ICDE A Rule-based Language for Deductive Object-Oriented Databases. A. M. Alashqur,Stanley Y. W. Su,Herman Lam 1990 A Rule-based Language for Deductive Object-Oriented Databases. ICDE Bill-of-Material Configuration Generation. Michael R. Blaha,William J. Premerlani,A. R. Bender,R. M. Salemme,M. M. Kornfein,C. K. Harkins 1990 Bill-of-Material Configuration Generation. ICDE "Title, General Chairman's Message, Program Chairman's Message, Reviewers, Table of Contents, Author Index." 1990 "Title, General Chairman's Message, Program Chairman's Message, Reviewers, Table of Contents, Author Index." ICDE The Generalized Grid File: Description and Performance Aspects. Henk M. Blanken,Alle IJbema,Paul Meek,Bert van den Akker 1990 The Generalized Grid File: Description and Performance Aspects. ICDE Join Index, Materialized View, and Hybrid-Hash Join: A Performance Analysis. José A. Blakeley,Nancy L. Martin 1990 Join Index, Materialized View, and Hybrid-Hash Join: A Performance Analysis. ICDE Compilation of Logic Programs to Implement Very Large Knowledge Base Systems - A Case Study: Educe*. Jorge B. Bocca 1990 Compilation of Logic Programs to Implement Very Large Knowledge Base Systems - A Case Study: Educe*. ICDE Attribute Inheritance Implemented on Top of a Relational Database System. Stefan Böttcher 1990 Attribute Inheritance Implemented on Top of a Relational Database System. ICDE Update Propagation in Distributed Memory Hierarchy. Matthew Bellew,Meichun Hsu,Va-On Tam 1990 Update Propagation in Distributed Memory Hierarchy. ICDE Modeling Design Object Relationships in PEGASUS. Alexandros Biliris 1990 Modeling Design Object Relationships in PEGASUS. ICDE Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms. Bharat K. Bhargava,Shy-Renn Lian,Pei-Jyun Leu 1990 Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms. ICDE An Attribute-Oriented Approach for Learning Classification Rules from Relational Databases. Yandong Cai,Nick Cercone,Jiawei Han 1990 An Attribute-Oriented Approach for Learning Classification Rules from Relational Databases. ICDE Generalization and a Framework for Query Modification. Surajit Chaudhuri 1990 Generalization and a Framework for Query Modification. ICDE Selectivity Estimation Using Homogeneity Measurement. Meng Chang Chen,Lawrence McNamee,Norman S. Matloff 1990 Selectivity Estimation Using Homogeneity Measurement. ICDE The Grid Protocol: A High Performance Scheme for Maintaining Replicated Data. Shun Yan Cheung,Mostafa H. Ammar,Mustaque Ahamad 1990 A new protocol for maintaining replicated data that can provide both high data availability and low response time is presented. In the protocol, the nodes are organized in a logical grid. Existing protocols are designed primarily to achieve high availability by updating a large fraction of the copies, which provides some (although not significant) load sharing. In the new protocol, transaction processing is shared effectively among nodes storing copies of the data, and both the response time experienced by transactions and the system throughput are improved significantly. The authors analyze the availability of the new protocol and use simulation to study the effect of load sharing on the response time of transactions. They also compare the new protocol with a voting-based scheme. ICDE Cost of Distributed Deadlock Detection: A Performance Study. Alok N. Choudhary 1990 Cost of Distributed Deadlock Detection: A Performance Study. ICDE Extending Object-Oriented Concepts to Support Engineering Applications. Aloysius Cornelio,Shamkant B. Navathe,Keith L. Doty 1990 Extending Object-Oriented Concepts to Support Engineering Applications. ICDE Database Buffer Model for the Data Sharing Environment. Asit Dan,Daniel M. Dias,Philip S. Yu 1990 Database Buffer Model for the Data Sharing Environment. ICDE A Paradigm for Concurrency Control in Heterogeneous Distributed Database Systems. Ahmed K. Elmagarmid,Weimin Du 1990 A Paradigm for Concurrency Control in Heterogeneous Distributed Database Systems. ICDE A Temporal Model and Query Language for ER Databases. Ramez Elmasri,Gene T. J. Wuu 1990 A Temporal Model and Query Language for ER Databases. ICDE Access Invariance and Its Use in High Contention Environments. Peter A. Franaszek,John T. Robinson,Alexander Thomasian 1990 Access Invariance and Its Use in High Contention Environments. ICDE A Multiuser Performance Analysis of Alternative Declustering Strategies. Shahram Ghandeharizadeh,David J. DeWitt 1990 A Multiuser Performance Analysis of Alternative Declustering Strategies. ICDE Multi-User View Integration System (MUVIS): An Expert System for View Integration. Stephen Hayne,Sudha Ram 1990 Multi-User View Integration System (MUVIS): An Expert System for View Integration. ICDE Object Views: Extending the Vision. Sandra Heiler,Stanley B. Zdonik 1990 Object Views: Extending the Vision. ICDE A Shared Conceptual Schema for Four Medical Expert Systems. James P. Held,John V. Carlis 1990 A Shared Conceptual Schema for Four Medical Expert Systems. ICDE Using a Meta Model to Represent Object-Oriented Data Models. Shuguang Hong,Fred J. Maryanski 1990 Using a Meta Model to Represent Object-Oriented Data Models. ICDE Chained Declustering: A New Availability Strategy for Multiprocessor Database Machines. Hui-I Hsiao,David J. DeWitt 1990 Chained Declustering: A New Availability Strategy for Multiprocessor Database Machines. ICDE The R-File: An Efficient Access Structure for Proximity Queries. Andreas Hutflesz,Hans-Werner Six,Peter Widmayer 1990 The R-File: An Efficient Access Structure for Proximity Queries. ICDE Parallelism in Object-Oriented Query Processing. Kyung-Chang Kim 1990 Parallelism in Object-Oriented Query Processing. ICDE A Flexible Transaction Model for Software Engineering. Gail E. Kaiser 1990 A Flexible Transaction Model for Software Engineering. ICDE System Issues in Parallel Sorting for Database Systems. Balakrishna R. Iyer,Daniel M. Dias 1990 System Issues in Parallel Sorting for Database Systems. ICDE Spatial Search with Polyhedra. H. V. Jagadish 1990 Spatial Search with Polyhedra. ICDE ViewSystem: Integrating Heterogeneous Information Bases by Object-Oriented Views. Manfred Kaul,Klaus Drosten,Erich J. Neuhold 1990 ViewSystem: Integrating Heterogeneous Information Bases by Object-Oriented Views. ICDE Alternatives in Complex Object Representation: A Performance Perspective. Anant Jhingran,Michael Stonebraker 1990 Alternatives in Complex Object Representation: A Performance Perspective. ICDE A Suitable Algorithm for Computing Partial Transitive Closures in Databases. Bin Jiang 1990 A Suitable Algorithm for Computing Partial Transitive Closures in Databases. ICDE Multilevel Secure Database Concurrency Control. Thomas F. Keefe,Wei-Tek Tsai,Jaideep Srivastava 1990 Multilevel Secure Database Concurrency Control. ICDE Long-Duration Transactions in Software Design Projects. Henry F. Korth,Gregory D. Speegle 1990 Computer-assisted design applications pose many problems for database systems. Standard notions of correctness are insufficient, but some notion of correctness is still required. Formal transaction models have been proposed for such applications. However, the practicality of such models has not been clearly established. In this paper, we consider an example of a software development application and apply the NT/PV model to show how this example could be represented as a set of database transactions. We show that although the standard notion of correctness (serializability) is too strict, the notion of correctness in the NT/PV model allows sufficient concurrency with acceptable overhead. We extrapolate from this example to draw some conclusions regarding the potential usefulness of a formal approach to the management of long-duration design transactions. ICDE An Object-Oriented Approach to Data/Knowledge Modeling Based on Logic. Kyuchul Lee,Sukho Lee 1990 An Object-Oriented Approach to Data/Knowledge Modeling Based on Logic. ICDE A Partitioned Signature File Structure for Multiattribute and Text Retrieval. Dik Lun Lee,Chun-Wu Roger Leng 1990 A Partitioned Signature File Structure for Multiattribute and Text Retrieval. ICDE Buffer and Load Balancing in Locally Distributed Database Systems. Hongjun Lu,Kian-Lee Tan 1990 Buffer and Load Balancing in Locally Distributed Database Systems. ICDE Propagating Updates in a Highly Replicated Database. Tony P. Ng 1990 Propagating Updates in a Highly Replicated Database. ICDE An Analysis of Borrowing Policies for Escrow Transactions in a Replicated Data Environment. Akhil Kumar 1990 An Analysis of Borrowing Policies for Escrow Transactions in a Replicated Data Environment. ICDE Query Processing for Temporal Databases. T. Y. Cliff Leung,Richard R. Muntz 1990 Query Processing for Temporal Databases. ICDE Multimedia Object Models for Synchronisation and Databases. Thomas D. C. Little,Arif Ghafoor 1990 Multimedia Object Models for Synchronisation and Databases. ICDE On Representing Indefinite and Maybe Information in Relational Databases: A Generalization. Ken-Chih Liu,Rajshekhar Sunderraman 1990 On Representing Indefinite and Maybe Information in Relational Databases: A Generalization. ICDE A Brief Overview of LILOG-DB. Thomas Ludwig 1990 A Brief Overview of LILOG-DB. ICDE Representing Processes in the Extended Entity-Relationship Model. Victor M. Markowitz 1990 Representing Processes in the Extended Entity-Relationship Model. ICDE "A ``Greedy'' Approach to the Write Problem in Shadowed Disk Systems." Norman S. Matloff,Raymond Wai-Man Lo 1990 "A ``Greedy'' Approach to the Write Problem in Shadowed Disk Systems." ICDE Structuring Knowledge Bases Using Automatic Learning. Guy W. Mineau,Jan Gecsei,Robert Godin 1990 Structuring Knowledge Bases Using Automatic Learning. ICDE An Algorithmic Basis for Integrating Production Systems and Large Databases. Daniel P. Miranker,David A. Brant 1990 An Algorithmic Basis for Integrating Production Systems and Large Databases. ICDE Concurrency Control of Bulk Access Transactions on Shared Nothing Parallel Database Machines. Tadashi Ohmori,Masaru Kitsuregawa,Hidehiko Tanaka 1990 Concurrency Control of Bulk Access Transactions on Shared Nothing Parallel Database Machines. ICDE A New Tree Type Data Structure with Homogeneous Nodes Suitable for a Very Large Spatial Database. Yutaka Ohsawa,Masao Sakauchi 1990 A New Tree Type Data Structure with Homogeneous Nodes Suitable for a Very Large Spatial Database. ICDE Serializability in Object-Oriented Database Systems. Thomas C. Rakow,Junzhong Gu,Erich J. Neuhold 1990 Serializability in Object-Oriented Database Systems. ICDE A Cooperative Approach to Large Knowledge Based Systems. C. V. Ramamoorthy,Shashi Shekhar 1990 A Cooperative Approach to Large Knowledge Based Systems. ICDE How to Share Work on Shared Objects in Design Databases. Michael Ranft,Simone Rehm,Klaus R. Dittrich 1990 How to Share Work on Shared Objects in Design Databases. ICDE Scheduling Data Redistribution in Distributed Databases. Pedro I. Rivera-Vega,Ravi Varadarajan,Shamkant B. Navathe 1990 Scheduling Data Redistribution in Distributed Databases. ICDE A Query Algebra for Object-Oriented Databases. Gail M. Shaw,Stanley B. Zdonik 1990 We define an algebra that synthesizes relational query concepts with object-oriented databases. The algebra fully supports abstract data types and object identity while providing associative access to objects, including a unique join capability. The operations take an abstract view of objects and access typed collections of objects through the public interface defined for the type. The algebra supports access to relationships implied by the structure of the objects, as well as the definition and creation of new relationships between objects. The structure of the algebra and the abstract access to objects offer opportunities for query optimization. ICDE A Modular Query Optimizer Generator. Edward Sciore,John Sieg Jr. 1990 A Modular Query Optimizer Generator. ICDE Currency-Based Updates to Distributed Materialized Views. Arie Segev,Weiping Fang 1990 Currency-Based Updates to Distributed Materialized Views. ICDE Extended Relations. John Sieg Jr.,Edward Sciore 1990 Extended Relations. ICDE The Semantic Data Model for Security: Representing the Security Semantics of an Application. Gary W. Smith 1990 The Semantic Data Model for Security: Representing the Security Semantics of an Application. ICDE Performance Evaluation of Multiversion Database Systems. Sang Hyuk Son,Navid Haghighi 1990 Performance Evaluation of Multiversion Database Systems. ICDE Parallelism in Database Production Systems. Jaideep Srivastava,Kuo-Wei Hwang,Jack S. Eddy Tan 1990 Parallelism in Database Production Systems. ICDE Distributed RAID - A New Multiple Copy Algorithm. Michael Stonebraker,Gerhard A. Schloss 1990 Distributed RAID - A New Multiple Copy Algorithm. ICDE Supporting Universal Quantification in a Two-Dimensional Database Query Language. Kyu-Young Whang,Ashok Malhotra,Gary H. Sockut,Luanne M. Burns 1990 Supporting Universal Quantification in a Two-Dimensional Database Query Language. ICDE The Fingerprinted Database. Neal R. Wagner,Robert L. Fountain,Robert J. Hazy 1990 The Fingerprinted Database. ICDE Concurrency Control Using Locking with Deferred Blocking. Philip S. Yu,Daniel M. Dias 1990 Concurrency Control Using Locking with Deferred Blocking. ICDE Experiences with Distributed Query Processing. Clement T. Yu,Chengwen Liu 1990 Experiences with Distributed Query Processing. SIGMOD Conference Efficient Updates to Independent Schemes in the Weak Instance Model. Paolo Atzeni,Riccardo Torlone 1990 The weak instance model is a framework to consider the relations in a database as a whole, regardless of the way attributes are grouped in the individual relations. Queries and updates can be performed involving any set of attributes. The management of updates is based on a lattice structure on the set of legal states, and inconsistencies and ambiguities can arise In the general case, the test for inconsistency and determinism may involve the application of the chase algorithm to the whole database. In this paper it is shown how, for the highly significant class of independent schemes, updates can be handled efficiently, considering only the relevant portion of the database. SIGMOD Conference OdeView: The Graphical Interface to Ode. Rakesh Agrawal,Narain H. Gehani,J. Srinivasan 1990 "OdeView is the graphical front end for Ode, an object-oriented database system and environment. Ode's data model supports data encapsulation, type inheritance, and complex objects. OdeView provides facilities for examining the database schema (i.e., the object type or class hierarchy), examining class definitions, browsing objects, following chains of references starting from an object, synchronized browsing, displaying selected portions of objects (projection), and retrieving objects with specific characteristics (selection). OdeView does not need to know about the internals of Ode objects. Consequently, the internals of specific classes are not hardwired into OdeView and new classes can be added to the Ode database without requiring any changes to or recompilation of OdeView. Just as OdeView does not know about the object internals, class functions (methods) for displaying objects are written without knowing about the specifics of the windowing software used by OdeView or the graphical user interface provided by it. In this paper, we present OdeView, and discuss its design and implementation." SIGMOD Conference OdeView: A User-Friendly Graphical Interface to Ode. Rakesh Agrawal,Narain H. Gehani,J. Srinivasan 1990 "OdeView is the graphical front end for Ode, an object-oriented database system and environment. It is intended for users who do not want to write programs in Ode's database programming language O++ to interact with Ode but instead want to use a friendlier interface to Ode. OdeView is based on the graphical direct manipulation paradigm that involves selection of items from pop-up menus and icons that can be clicked on and dragged. OdeView provides facilities for examining the database schema examining class definitions, browsing objects, following chains of references, displaying selected portions of objects or selecting a subset of the ways in which an object can be displayed (projection), and retrieving specific objects (selection). Upon entering OdeView, the user is presented with a scrollable “database” window containing the names and iconified images of the current Ode databases. The user can select a database to interact with by using the mouse to click on the appropriate icon. OdeView then opens a “class relationship” window which displays the hierarchy relationship between the object classes database. The hierarchy relationship between classes is a set of dags. The user can zoom in and zoom out to examine this dag at various levels of detail. The user can also examine a class in detail by clicking at the node labeled with the class of interest. Clicking results in the opening of a “class information” window that has three scrollable subwindows, one showing its superclasses, the second its subclasses, and the third showing the meta data associated with this class. The class information window also has a button, clicking which shows the class definition. The user may continue schema browsing by selecting another node in the schema graph, or may click on one of the superclasses or subclasses. Associated with each class in Ode a the set of persistent objects of that class, called cluster. The class definition window has an “objects” button that allows users to browse through the objects in the cluster. Clicking this button opens the “object set” window which consists of two parts the control and object panels. The control panel consists of buttons reset, next, and previous to sequence through the objects. The object panel has buttons to view the object, projection (to view parts of the object), and to specify the selection criteria. An Ode object can be displayed in one or more formats depending upon the semantics of the display function associated with the corresponding class. The object set window supplies one button each for each of the object display formats. For example, an employee object can be displayed textually or in pictorial form, the object panel for employee will provides appropriate buttons to see these displays. An object may contain embedded references to other objects. The object panel of an object set window provides buttons for viewing these referenced objects. The basic browsing paradigm encouraged by OdeView is to start from an object and then explore the related objects in the database by following the embedded chains of references. To speed up such repetitive navigations, OdeView supports synchronized browsing. Once the user has displayed a network of objects and the user applies a sequencing operation to any object in this network, the sequencing operation is automatically propagated over the network. OdeView is implemented using X-Windows and HP-Widgets on a SUN workstation running the UNIX system. The video takes the viewers on a tour of OdeView, showing how a user interacts with OdeView to examine the database schema and the objects in the database." SIGMOD Conference The Object-Oriented Database System Manifesto. Malcolm P. Atkinson,François Bancilhon,David J. DeWitt,Klaus R. Dittrich,David Maier,Stanley B. Zdonik 1990 The Object-Oriented Database System Manifesto. SIGMOD Conference Performance Evaluation of Semantics-based Multilevel Concurrency Control Protocols. B. R. Badrinath,Krithi Ramamritham 1990 For next generation information systems, concurrency control mechanisms are required to handle high level abstract operations and to meet high throughput demands. The currently available single level concurrency control mechanisms for reads and writes are inadequate for future complex information systems. In this paper, we will present a new multilevel concurrency protocol that uses a semantics-based notion of conflict, which is weaker than commutativity, called recoverability. Further, operations are scheduled according to relative conflict, a conflict notion based on the structure of operations. Performance evaluation via extensive simulation studies show that with our multilevel concurrency control protocol, the performance improvement is significant when compared to that of a single level two-phase locking based concurrency control scheme or to that of a multilevel concurrency control scheme based on commutativity alone. Further, simulation studies show that our new multilevel concurrency control protocol performs better even with resource contention. SIGMOD Conference Implementing Recoverable Requests Using Queues. Philip A. Bernstein,Meichun Hsu,Bruce Mann 1990 Transactions have been rigorously defined and extensively studied in the database and transaction processing literature, but little has been said about the handling of the requests for transaction execution in commercial TP systems, especially distributed ones, managing the flow of requests is often as important as executing the transactions themselves. This paper studies fault-tolerant protocols for managing the flow of transaction requests between clients that issue requests and servers that process them. We discuss how to implement these protocols using transactions and recoverable queuing systems. Queuing systems are used to move requests reliably between clients and servers. The protocols use queuing systems to ensure that the server processes each request exactly once and that a client processes each reply at least once. We treat request-reply protocols for single-transaction requests, for multi-transaction requests, and for requests that require interaction with the display after the request is submitted. SIGMOD Conference The INA: A Simple Query Language with Only Attribute Names. Bruce I. Blum,Ralph D. Semmel 1990 Current query languages, such as SQL, assume that the user is familiar with the database schema including the attribute names, types, and relation associations. When a user has imperfect knowledge of this information (or when he balks at the data-processing orientation of the required statements), he normally asks an experienced analyst to perform his and hoc query. The Intelligent Navigational Assistant (INA) was developed for the U S Army as a prototype query tool that permits the users to specify requests using only domain terms familiar to them. Once a request is made, it is converted into SQL for processing1,2 To facilitate query formulation, the INA supports an interface that allows the user to identify attributes without relation associations (i.e., treats the data model as a universal relation). Because an attribute may appear in many relations, one of the principal tasks of the INA is the determination of the appropriate relation bindings. To aid in the selection of terms, the INA maintains a user vocabulary and provides facilities for browsing the vocabulary and examining term definitions. Thus, the INA has two primary functions it provides an easy-to-use interface for query definition, and it converts a request into SQL. The INA prototype has been implemented as a PC-resident knowledge-based system linked to a host-based DBMS. Its knowledge base is the logical schema of the target database, and the query transformation relies on the dependencies implicit in that schema. Supporting the knowledge-processing functions are the query definition interface, various tools to manage the target data model description, and facilities for communicating with other computers. The system was developed using TEDIUM@@@@,3 and the user interface and query resolution mechanism are extensions of earlier work with Tequila4 (which accessed the semantically-richer TEDIUM@@@@ data model) Work on the INA began in 1987 and was terminated in 1988. The system was demonstrated as a prototype with an Army-supplied logical model consisting of approximately 40 relations and 200 attributes. After query definition, reformation, and user acceptance, the SQL queries were submitted to the mainframe for processing. In those tests, the INA often produced better queries than those manually coded by analysts. The INA currently is undergoing a beta test with a much larger database schema. Its algorithms are described in reference 5, and reference 3 contains details regarding its implementation and semantic data model. Current research includes the development of improved query resolution algorithms based on an enriched semantic data model SIGMOD Conference The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. Norbert Beckmann,Hans-Peter Kriegel,Ralf Schneider,Bernhard Seeger 1990 "The R-tree, one of the most popular access methods for rectangles, is based on the heuristic optimization of the area of the enclosing rectangle in each inner node. By running numerous experiments in a standardized testbed under highly varying data, queries and operations, we were able to design the R*-tree which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory. Using our standardized testbed in an exhaustive performance comparison, it turned out that the R*-tree clearly outperforms the existing R-tree variants. Guttman's linear and quadratic R-tree and Greene's variant of the R-tree. This superiority of the R*-tree holds for different types of queries and operations, such as map overlay, for both rectangles and multidimensional points in all experiments. From a practical point of view the R*-tree is very attractive because of the following two reasons 1 it efficiently supports point and spatial data at the same time and 2 its implementation cost is only slightly higher than that of other R-trees." SIGMOD Conference Reliable Transaction Management in a Multidatabase System. Yuri Breitbart,Abraham Silberschatz,Glenn R. Thompson 1990 A model of a multidatabase system is defined in which each local DBMS uses the two-phase locking protocol Locks are released by a global transaction only after the transaction commits or aborts at each local site. Failures may occur during the processing of transactions. We design a fault tolerant transaction management algorithm and recovery procedures that retain global database consistency. We also show that our algorithms ensure freedom from global deadlocks of any kind. SIGMOD Conference Integrating Object-Oriented Data Modeling with a Rule-Based Programming Paradigm. Filippo Cacace,Stefano Ceri,Stefano Crespi-Reghizzi,Letizia Tanca,Roberto Zicari 1990 LOGRES is a new project for the development of extended database systems which is based on the integration of the object-oriented data modelling paradigm and of the rule-based approach for the specification of queries and updates. The data model supports generalization hierarchies and object sharing, the rule-based language extends Datalog to support generalized type constructors (sets, multisets, and sequences), rule-based integrity constraints are automatically produced by analyzing schema definitions. Modularization is a fundamental feature, as modules encapsulate queries and updates, when modules are applied to a LOGRES database, their side effects can be controlled. The LOGRES project is a follow-up of the ALGRES project, and takes advantage of the ALGRES programming environment for the development of a fast prototype. SIGMOD Conference Kaleidoscope: A Cooperative Menu-Guided Query Interface. Sang Kyun Cha,Gio Wiederhold 1990 "Querying databases to obtain information requires the user's knowledge of query language and underlying data. However, because the knowledge in human long-term memory is imprecise, incomplete, and often incorrect, user queries are subject to various types of failure. These may include spelling mistakes, the violation of the syntax and semantics of a query language, and the misconception of the entities and relationships in a database. Kaleidoscope is a cooperative query interface whose knowledge guides users to avoid most failure during query creation. We call this type of cooperative behavior intraquery guidance. To enable this early, active engagement in the user's process of query creation, Kaleidoscope reduces the granularity of user-system interaction via a context-sensitive menu. The system generates valid query constituents as menu choices step-by-step by interpreting a language grammar, and the user creates a query following this menu guidance[2]. For instance, it takes four steps to create the following query [Q1] Who/1 authored/2 'Al'/3 journal papers/(3+) in 'Postquery COOP'/4 At each of such steps, as the user selects one of menu choices, the system updates its partial query status window. If a choice is unique as in (3+), it is taken automatically. To guide the user's entry of values, the system provides a pop-up menu for each value domain. With Kaleidoscope's process of choice generation tightly controlled by the system's knowledge of query language and underlying data, users need not remember the query language and the underlying database structure but merely recognize or identify the constituents coming one after another that match their intended query. The system provides additional guidance for users to avoid creating semantically inconsistent queries. It informs the user of any derived predicates on the completion of a user-selected predicate. To illustrate this, consider a partially constructed SQL query [Q2] SELECT * FROM professor p#1 WHERE p#1 dept = 'CS' AND p#1 salary < 40000 Suppose that the system has an integrity constraint [IC] FROM professor p IF p dept = 'CS' AND p salary < 45000 THEN p rank = 'Assistant' This rules states that a CS professor whose salary is less than 45000 is an assistant professor. With the replacement of rule variable p in IC by Q2's range variable p#1, IC's leading two predicates subsume Q2's query condition, producing p#1 rank = 'Assistant'. Because this derived predicate is not subsumed by Q2's query condition, the system suspects that the user may not know of it and presents it to the user. Derived predicates, together with user-selected ones, constrain the user's further conjunctive extension of the partial query condition. For example, the system prunes the field rank (as well as the field dept) in the conjunctive extension of Q2, because the derived condition restricts the value of this field to a constant. As shown in examples, we apply Kaleidoscope's approach to two linear-syntax languages in different levels of abstraction SQL[1] and a query language whose syntax and semantics cover a subset of wh-queries. To implement the intraquery guidance, we extend context-free grammar by associating context variables with each grammar symbol and attaching several types of procedural decorations to grammar rules. This extension enables the system to capture the semantic constraints and its user-guiding actions in a domain-independent grammar. As the grammar is interpreted, the database-specific information is fed from the system's lexicon and knowledge base. The current implementation of Kaleidoscope runs on a XEROX-1186 LISP machine with a SUN server configured with a relational DBMS. The approach of Kaleidoscope is based on the normative system assumption. The system presents its capability transparently to the user in a context-dependent manner during the user's query creation. This makes the system usable even with a small amount of stored knowledge." SIGMOD Conference ACTA: A Framework for Specifying and Reasoning about Transaction Structure and Behavior. Panos K. Chrysanthis,Krithi Ramamritham 1990 "Recently, a number of extensions to the traditional transaction model have been proposed to support new information-intensive applications such as CAD/CAM and software development. However, these extended models capture only a subset of interactions that can be found in such applications, and represent only some of the points within the spectrum of interactions possible in competitive and cooperative environments. ACTA is a formalizable framework developed for characterizing the whole spectrum of interactions. The ACTA framework is not yet another transaction model, but is intended to unify the existing models. ACTA allows for specifying the structure and the behavior of transactions as well as for reasoning about the concurrency and recovery properties of the transactions. In ACTA, the semantics of interactions are expressed in terms of transactions' effects on the commit and abort of other transactions and on objects' state and concurrency status (i.e., synchronization state). Its ability to capture the semantics of previously proposed transaction models is indicative of its generality. The reasoning capabilities of this framework have also been tested by using the framework to study the properties of a new model that is derived by combining two existing transaction models." SIGMOD Conference The G+/GraphLog Visual Query System. Mariano P. Consens,Alberto O. Mendelzon 1990 The video presentation “The G+/GraphLog Visual Query System” gives an overview of the capabilities of the ongoing implementation of the G+ Visual Query System for visualizing both data and queries as graphs. The system provides an environment for expressing queries in GraphLog [Con89, CM89, CM90], as well as for browsing, displaying and editing graphs. The visual query system also supports displaying the answers in several different ways. Graphs are a very natural representation for data in many application domains, for example, transportation networks, project scheduling, parts hierarchies, family trees, concept hierarchies, and Hypertext. From a broader perspective, many databases can be naturally viewed as graphs. In particular, any relational database in which we can identify one or more sets of objects of interest and relationships between them can be represented by mapping these objects into nodes and relationships into edges. In the case of semantic and object-oriented databases, there is a natural mapping of objects to nodes and attributes to edges. GraphLog is a visual query language, based on a graph representation of both data and queries, that has evolved from the earlier language G+ [CMW87, CMW89, MW89]. GraphLog queries ask for patterns that must be present or absent in the database graph. Each such pattern, called a query graph, defines new edges that are added to the graph whenever the pattern is found. GraphLog queries are sets of query graphs, called graphical queries. If, when looking at a query graph in a graphical query, we do not find an edge label in the database, then there must exist another query graph in the graphical query defining that edge. The language also supports computing aggregate functions and summarizing along paths. The G+ Visual Query System is currently implemented in Smalltalk-80™, and runs on Sun 3, Sun 4 and Macintosh II workstations. A Graph Editor is available for editing query graphs and displaying database graphs. It supports graph “cutting and pasting”, as well as text editing of node and edge labels, node and edge repositioning and re-shaping, storage and retrieval of graphs as text files, etc. Automatic graph layout is also provided. For editing collections of graphs (such as graphical queries) a Graph Browser is available. The first answer mode supported by the G+ Visual Query System is to return as the result of a GraphLog query a graph with the new edges defined by the graphical query added to the database graph. An alternative way of visualizing answers is by high-lighting on the database graph, one at a time, the paths (or just the nodes) described by the query. This mode is particularly useful to locate interesting starting points for browsing. Rather than viewing the answers superimposed on the database graph, the user may choose to view them in a Graph Browser. The Graph Browser contains the set of subgraphs of the database graph that were found to satisfy the query. Finally, the user may select to collect all the subgraphs of the database graph that satisfy the query together into one new graph. This graph (as well as any other result graph from any of the above mentioned answer modes) in turn may be queried, providing a mechanism for iterative filtering of irrelevant information until a manageable subgraph is obtained. SIGMOD Conference Organizing Long-Running Activities with Triggers and Transactions. Umeshwar Dayal,Meichun Hsu,Rivka Ladin 1990 This paper addresses the problem of organising and controlling activities that involve multiple steps of processing and that typically are of long duration. We explore the use of triggers and transactions to specify and organize such long-running activities. Triggers offer data- or event-driven specification of control flow, and thus provide a flexible and modular framework with which the control structures of the activities can be extended or modified. We describe a model based on event-condition-action rules and coupling modes. The execution of these rules is governed by an extended nested transaction model. Through a detailed example, we illustrate the utility of the various features of the model for chaining related steps without sacrificing concurrency, for enforcing integrity constraints, and for providing flexible failure and exception handling. SIGMOD Conference Encapsulation of Parallelism in the Volcano Query Processing System. Goetz Graefe 1990 "Volcano is a new dataflow query processing system we have developed for database systems research and education. The uniform interface between operators makes Volcano extensible by new operators. All operators are designed and coded as if they were meant for a single-process system only. When attempting to parallelize Volcano, we had to choose between two models of parallelization, called here the bracket and operator models. We describe the reasons for not choosing the bracket model, introduce the novel operator model, and provide details of Volcano's exchange operator that parallelizes all other operators. It allows intra-operator parallelism on partitioned datasets and both vertical and horizontal inter-operator parallelism. The exchange operator encapsulates all parallelism issues and therefore makes implementation of parallel database algorithms significantly easier and more robust. Included in this encapsulation is the translation between demand-driven dataflow within processes and data-driven dataflow between processes. Since the interface between Volcano operators is similar to the one used in “real,” commercial systems, the techniques described here can be used to parallelize other query processing engines." SIGMOD Conference A Framework for the Parallel Processing of Datalog Queries. Sumit Ganguly,Abraham Silberschatz,Shalom Tsur 1990 This paper presents several complementary methods for the parallel, bottom-up evaluation of Datalog queries. We introduce the notion of a discriminating predicate, based on hash functions, that partitions the computation between the processors in order to achieve parallelism. A parallelization scheme with the property of non-redundant computation (no duplication of computation by processors) is then studied in detail. The mapping of Datalog programs onto a network of processors, such that the results is a non-redundant computation, is also studied. The methods reported in this paper clearly demonstrate the trade-offs between redundancy and interprocessor-communication for this class of problems. SIGMOD Conference A Graph-Oriented Object Model for Database End-User Interfaces. Marc Gyssens,Jan Paredaens,Dirk Van Gucht 1990 A Graph-Oriented Object Model for Database End-User Interfaces. SIGMOD Conference A Predicate Matching Algorithm for Database Rule Systems. Eric N. Hanson,Moez Chaabouni,Chang-Ho Kim,Yu-Wang Wang 1990 Forward-chaining rule systems must test each newly asserted fact against a collection of predicates to find those rules that match the fact. Expert system rule engines use a simple combination of hashing and sequential search for this matching. We introduce an algorithm for finding the matching predicates that is more efficient than the standard algorithm when the number of predicates is large. We focus on equality and inequality predicates on totally ordered domains. This algorithm is well-suited for database rule systems, where predicate-testing speed is critical. A key component of the algorithm is the interval binary search tree (IBS-tree). The IBS-tree is designed to allow efficient retrieval of all intervals (e.g. range predicates) that overlap a point, while allowing dynamic insertion and deletion of intervals. The algorithm could also be used to improve the performance of forward-chaining inference engines for large expert systems applications. SIGMOD Conference Randomized Algorithms for Optimizing Large Join Queries. Yannis E. Ioannidis,Younkyung Cha Kang 1990 "Query optimization for relational database systems is a combinatorial optimization problem, which makes exhaustive search unacceptable as the query size grows. Randomized algorithms, such as Simulated Annealing (SA) and Iterative Improvement (II), are viable alternatives to exhaustive search. We have adapted these algorithms to the optimization of project-select-join queries. We have tested them on large queries of various types with different databases, concluding that in most cases SA identifies a lower cost access plan than II. To explain this result, we have studied the shape of the cost function over the solution space associated with such queries and we have conjectured that it resembles a 'cup' with relatively small variations at the bottom. This has inspired a new Two Phase Optimization algorithm, which is a combination of Simulated Annealing and Iterative Improvement. Experimental results show that Two Phase Optimization outperforms the original algorithms in terms of both output quality and running time." SIGMOD Conference Linear Clustering of Objects with Multiple Atributes. H. V. Jagadish 1990 "There is often a need to map a multi-dimensional space on to a one-dimensional space. For example, this kind of mapping has been proposed to permit the use of one-dimensional indexing techniques to a multi-dimensional index space such as in a spatial database. This kind of mapping is also of value in assigning physical storage, such as assigning buckets to records that have been indexed on multiple attributes, to minimize the disk access effort. In this paper, we discuss what the desired properties of such a mapping are, and evaluate, through analysis and simulation, several mappings that have been proposed in the past. We present a mapping based on Hilbert's space-filling curve, which out-performs previously proposed mappings on average over a variety of different operating conditions." SIGMOD Conference Access Support in Object Bases. Alfons Kemper,Guido Moerkotte 1990 In this work access support relations are introduced as a means for optimizing query processing in object-oriented database systems. The general idea is to maintain redundant separate structures (disassociated from the object representation) to store object references that are frequently traversed in database queries. The proposed access support relation technique is no longer restricted to relate an object (tuple) to an atomic value (attribute value) as in conventional indexing. Rather, access support relations relate objects with each other and can span over reference chains which may contain collection-valued components in order to support queries involving path expressions. We present several alternative extensions of access support relations for a given path expression, the best of which has to be determined according to the application-specific database usage profile. An analytical cost model for access support relations and their application is developed. This analytical cost model is, in particular, used to determine the best access support relation extension and decomposition with respect to the specific database configuration and application profile. SIGMOD Conference The Iris Database System. William Kent,Peter Lyngbæk,Samir Mathur,W. Kevin Wilkinson 1990 The Iris Database System. SIGMOD Conference Making Deductive Databases a Practical Technology: A Step Forward. Gerald Kiernan,Christophe de Maindreville,Eric Simon 1990 Deductive databases provide a formal framework to study rule-based query languages that are extensions of first-order logic. However, deductive database languages and their current implementations do not seem appropriate for improving the development of real applications or even sample of them. Our goal is to make deductive database technology practical. The design and implementation of the RDL1 system, presented in this paper, constitute a step toward this goal. Our approach is based on the integration of a production rule language within a relational database system, the development of a rule-based programming environment and the support of system extensibility using Abstract Data Types. We discuss important practical experience gained during the implementation of the system. Also, comparisons with related work such as LDL, STARBURST and POSTGRES are given. SIGMOD Conference Bayan: An Arabic Text Database Management System. Roger King,Ali Morfeq 1990 "Most existing databases lack features which allow for the convenient manipulation of text. It is even more difficult to use them if the text language is not based on the Roman alphabet. The Arabic language is a very good example of this case. Many projects have attempted to use conventional database systems for Arabic data manipulation (including text data), but because of Arabic's many differences with English, these projects have met with limited success. In the Bayan project, the approach has been different. Instead of simply trying to adopt an environment to Arabic, the properties of the Arabic language were the starting point and everything was designed to meet the needs of Arabic, thus avoiding the shortcomings of other projects. A text database management system was designed to overcome the shortcomings of conventional database management systems in manipulating text data. Bayan's data model is based on an object-oriented approach which helps the extensibility of the system for future use. In Bayan, we designed the database with the Arabic text properties in mind. We designed it to support the way Arabic words are derived, classified, and constructed. Furthermore, linguistic algorithms (for word generation and morphological decomposition of words) were designed, leading to a formalization of rules of Arabic language writing and sentence construction. A user interface was designed on top of this environment. A new representation of the Arabic characters was designed, a complete Arabic keyboard layout was created, and a window-based Arabic user interface was also designed." SIGMOD Conference Concurrency Control in Multilevel-Secure Databases Based on Replicated Architecture. Boris Kogan,Sushil Jajodia 1990 In a multilevel secure database management system based on the replicated architecture, there is a separate database management system to manage data at or below each security level, and lower level data are replicated in all databases containing higher level data. In this paper, we address the open issue of concurrency control in such a system. We give a secure protocol that guarantees one-copy serializability of concurrent transaction executions and can be implemented in such a way that the size of the trusted code (including the code required for concurrency and recovery) is small. SIGMOD Conference Pasta-3: A Graphical Direct Manipulation Interface for Knowledge Base Management Systems. Michel Kuntz 1990 "Pasta-3 is an end-user interface for D/KBMSs based on the graphical Direct Manipulation (DM) interaction paradigm, which relies on a bit-mapped, multi-window screen and a mouse to implement clickable icons as the main representation of information. This style of interaction enables end users to learn quickly and remember easily how the system works. Pasta-3 gives complete access to the D/KBMS, since its users can carry out all manipulation tasks through it schema definition, schema and data browsing, query formulation, and updating. These tasks can be freely mixed, combined, and switched Pasta-3 interfaces to the KB2 knowledge base system, implemented in Prolog and built over the EDUCE system which provides a tight coupling to a relational DBMS KB2 uses the Entity-Relationship data model, extended with inheritance and deduction rules. KB2 was developed by the KB Group at ECRC. Pasta-3 uses Direct Manipulation in the strong sense of the term DM of the actual graphical representations of the application data and not just DM of commands operating on that data. Besides the high degree of integration in the overall design, major innovations with respect to earlier work include enhanced schema browsing with active functionalities to facilitate correct user understanding of the KB structure, “synchronized” data browsing that exploits the underlying semantic data model to make browsing more powerful, and a graphical query language providing full expressive power (including certain recursive queries, nested subsequeries, quantification). Pasta-3 provides interactive design support that has significant ergonomic advantages over the usual approach to this problem. In Pasta-3 different types of schema information — the basic E-R diagram, and inheritance lattices, the properties of each E-R item — are displayed in separate windows, which makes accurate reading of such information much less difficult than in the usual case where all these layers are thrown together in a single graph, which makes misinterpretation hard to avoid. For schema and data browsing, Pasta-3 offers facilities that build more semantics into the browsing processes. One type of schema browsing tool is a subgraph computation capability which automatically finds and displays the paths that connect arbitrary E-R items. This helps end users to correctly perceive the schema structure. Data browsing includes “synchronised” browsing, a functionality which shows simultaneously data from several Entities all sharing the same Relationship and indicates which values from each Entity are associated with given values from the others. Pasta-3's DM query language replaces the textual language without loss of expressive power it offers a new, sophisticated DM editing capability for the same formal constructs. Query specification takes place in a window containing icons representing the components of the query expression which can be created, destroyed, and modified all by clicking and dragging through the mouse. Queries can be recursive and involve logical variables, quantification, and subqueries. Expressions mixing both KB2 statements and Prolog predicates can also be formulated. The video shows Pasta-3 actually being used, in real time and under normal conditions. It includes sequences demonstrating all three major functionalities schema design browsing, and querying. It gives an example of the subgraph computation capability and builds a simple query from scratch, going through all the steps needed to do so. The demonstration also includes work with other types of Pasta-3 windows (e g property sheets). The video has an English-language sound track explaining everything that is seen on the screen. The camera zooms in and out in order to show full screen overviews (giving a good idea of the general “feel” of the interface) and close-ups of work with mouse and icons (allowing the viewer to see as much detail in the video as an actual user would, seated in front of the workstation)." SIGMOD Conference Extending Logic Programming. Els Laenens,Domenico Saccà,Dirk Vermeir 1990 An extension of logic programming, called “ordered logic programming”, which includes some abstractions of the object-oriented paradigm, is presented. An ordered program consists of a number of modules (objects), where each module is composed by a number of rules possibly with negated head predicates. A sort of “isa” hierarchy can be defined among the modules in order to allow for rule inheritance. Therefore, every module sees its own rules as local rules and the rules of the other modules to which it is connected by the “isa” hierarchy as global rules. In this way, as local rules may hide global rules, it is possible to deal with default properties and exceptions. This new approach represents a novel attempt to combine the logic paradigm with the object-oriented one in knowledge base systems. Moreover, this approach provides a new ground for explaining some recent proposals of semantics for classical logic programs with negation in the rule bodies and gives an interesting semantics to logic programs with negated rule heads. SIGMOD Conference A Starburst is Born. George Lapis,Guy M. Lohman,Hamid Pirahesh 1990 A Starburst is Born. SIGMOD Conference The Performance of a Multiversion Access Method. David B. Lomet,Betty Salzberg 1990 The Time-Split B-tree is an integrated index structure for a versioned timestamped database. It gradually migrates data from a current database to an historical database, records migrating when nodes split. Records valid at the split time are placed in both an historical node and a current node. This implies some redundancy. Using both analysis and simulation, we characterise the amount of redundancy, the space utilization, and the record addition (insert or update) performance for a spectrum of different rates of insertion versus update. Three splitting policies are studied which alter the conditions under which either time splits or key space splits are performed. SIGMOD Conference Practical Selectivity Estimation through Adaptive Sampling. Richard J. Lipton,Jeffrey F. Naughton,Donovan A. Schneider 1990 Recently we have proposed an adaptive, random sampling algorithm for general query size estimation. In earlier work we analyzed the asymptotic efficiency and accuracy of the algorithm, in this paper we investigate its practicality as applied to selects and joins. First, we extend our previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy. Next, we provide “sanity bounds” to deal with queries for which the underlying data is extremely skewed or the query result is very small. Finally, we report on the performance of the estimation algorithm as implemented in a host language on a commercial relational system. The results are encouraging, even with this loose coupling between the estimation algorithm and the DBMS. SIGMOD Conference Querying Database Knowledge. Amihai Motro,Qiuhui Yuan 1990 The role of database knowledge is usually limited to the evaluation of data queries. In this paper we argue that when this knowledge is of substantial volume and complexity, there is genuine need to query this repository of information. Moreover, since users of the database may not be able to distinguish between information that is data and information that is knowledge, access to knowledge and data should be provided with a single, coherent instrument. We provide an informal review of various kinds of knowledge queries, with possible syntax and semantics. We then formalize a framework of knowledge-rich databases, and a simple query language consisting of a pair of retrieve and describe statements. The retrieve statement is for querying the data (it corresponds to the basic retrieval statement of various knowledge-rich database systems). The describe statement is for querying the knowledge. Essentially, it inquires about the meaning of a concept under specified circumstances. We provide algorithms for evaluating sound and finite knowledge answers to describe queries, and we demonstrate them with examples. SIGMOD Conference Magic is Relevant. Inderpal Singh Mumick,Sheldon J. Finkelstein,Hamid Pirahesh,Raghu Ramakrishnan 1990 We define the magic-sets transformation for traditional relational systems (with duplicates, aggregation and grouping), as well as for relational systems extended with recursion. We compare the magic-sets rewriting to traditional optimization techniques for nonrecursive queries, and use performance experiments to argue that the magic-sets transformation is often a better optimization technique. SIGMOD Conference Random Sampling from Hash Files. Frank Olken,Doron Rotem,Ping Xu 1990 In this paper we discuss simple random sampling from hash files on secondary storage. We consider both iterative and batch sampling algorithms from both static and dynamic hashing methods. The static methods considered are open addressing hash files and hash files with separate overflow chains. The dynamic hashing methods considered are Linear Hash files [Lit80] and Extendible Hash files [FNPS79]. We give the cost of sampling in terms of the cost of successfully searching a hash file and show how to exploit features of the dynamic hashing methods to improve sampling efficiency. SIGMOD Conference A Comparison of Spatial Query Processing Techniques for Native and Parameter Spaces. Jack A. Orenstein 1990 Spatial queries can be evaluated in native space or in a parameter space. In the latter case, data objects are transformed into points and query objects are transformed into search regions. The requirement for different data and query representations may prevent the use of parameter-space searching in some applications. Native-space and parameter-space searching are compared in the context of a z order-based spatial access method. Experimental results show that when there is a single query object, searching in parameter space can be faster than searching in native space, if the data and query objects are large enough, and if sufficient redundancy is used for the query representation. The result is, however, less accurate than the native space result. When there are multiple query objects, native-space searching is better initially, but as the number of query objects increases, parameter space searching with low redundancy is superior. Native-space searching is much more accurate for multiple-object queries. SIGMOD Conference Database Management Issues of the Human Genome Project. Robert M. Pecherer 1990 Database Management Issues of the Human Genome Project. SIGMOD Conference Query Graphs, Implementing Trees, and Freely-Reorderable Outerjoins. Arnon Rosenthal,César A. Galindo-Legaria 1990 We determine when a join/outerjoin query can be expressed unambiguously as a query graph, without an explicit specification of the order of evaluation. To do so, we first characterize the set of expression trees that implement a given join/outerjoin query graph, and investigate the existence of transformations among the various trees. Our main theorem is that a join/outerjoin query is freely reorderable if the query graph derived from it falls within a particular class, every tree that “implements” such a graph evaluates to the same result. The result has applications to language design and query optimization. Languages that generate queries within such a class do not require the user to indicate priority among join operations, and hence may present a simplified syntax. And it is unnecessary to add extensive analyses to a conventional query optimizer in order to generate legal reorderings for a freely-reorderable language. SIGMOD Conference FastSort: A Distributed Single-Input Single-Output External Sort. Betty Salzberg,Alex Tsukerman,Jim Gray,Michael Stewart,Susan Uren,Bonnie Vaughan 1990 External single-input single-output sorts can use multiple processors each with a large tournament replacement-selection in memory, and each with private disks to sort an input stream in linear elapsed time. Of course, increased numbers of processors, memories, and disks are required as the input file size grows. This paper analyzes the algorithm and reports the performance of an implementation. SIGMOD Conference Hard Problems for Simple Logic Programs. Yatin P. Saraiya 1990 A number of optimizations have been proposed for Datalog programs involving a single intensional predicate (“single-IDB programs”). Examples include the detection of commutativity and separability ([Naug88],[RSUV89], [Ioan89a]) in linear logic programs, and the detection of ZYT-linearizability ([ZYT88], [RSUV89], [Sara89], [Sara90]) in nonlinear programs. We show that the natural generalizations of the commutativity and ZYT-linearizability problems (respectively, the sequencability and base-case linearizability problems) are undecidable. Our constructions involve the simulation of context-free grammars using single-IDB programs that have a bounded number of initialisation rules. The constructions may be used to show that containment (or equivalence) is undecidable for such programs, even if the programs are linear, or if each program contains a single recursive rule. These results tighten those of [Shmu87] and [Abit89]. SIGMOD Conference A Performance Evaluation of Pointer-Based Joins. Eugene J. Shekita,Michael J. Carey 1990 In this paper we describe three pointer-based join algorithms that are simple variants of the nested-loops, sort-merge, and hybrid-hash join algorithms used in relational database systems. Each join algorithm is described and an analysis is carried out to compare the performance of the pointer-based algorithms to their standard, non-pointer-based counterparts. The results of the analysis show that the pointer-based algorithms can provide significant performance gains in many situations. The results also show that the pointer-based nested-loops join algorithm, which is perhaps the most natural pointer-based join algorithm to consider using in an object-oriented database system, performs quite poorly on most medium to large joins. SIGMOD Conference IDLOG: Extending the Expressive Power of Deductive Database Languages. Yeh-Heng Sheng 1990 The expressive power of pure deductive database languages, such as DATALOG and stratified DATALOGS, is limited in a sense that some useful queries such as functions involving aggregation are not definable in these languages. Our concern in this paper is to provide a uniform logic framework for deductive databases with greater expressive power. It has been shown that with a linear ordering on the domain of the database, the expressive power of some database languages can be enhanced so that some functions involving aggregation can be defined. Yet, a direct implementation of the linear ordering in deductive database languages may seem unintuitive, and may not be very efficient to use in practice. We propose a logic for deductive databases which employs the notion of “identifying each tuple in a relation”. Through the use of these tuple-identifications, different linear orderings are defined as a result. This intuitively explains the reason why our logic has greater expressive power. The proposed logic language is non-deterministu in nature. However, non-determinism is not the real reason for the enhanced expressive power. A deterministic subset of the programs in this language is computational complete in the sense that it defines all the computable deterministic queries. Although the problem of deciding whether a program is in this subset is in general undecidable, we do provide a rather general sufficient test for identifying such programs. Also discussed in this paper is an extended notion of queries which allows both the input and the output of a query to contain interpreted constants of an infinite domain. We show that extended queries involving aggregation can also be defined in the language. SIGMOD Conference Write-Only Disk Caches. Jon A. Solworth,Cyril U. Orji 1990 With recent declines in the cost of semiconductor memory and the increasing need for high performance I/O disk systems, it makes sense to consider the design of large caches. In this paper, we consider the effect of caching writes. We show that cache sizes in the range of a few percent allow writes to be performed at negligible or no cost and independently of locality considerations. SIGMOD Conference The Postgres DBMS. Michael Stonebraker 1990 The Postgres DBMS. SIGMOD Conference On Rules, Procedures, Caching and Views in Data Base Systems. Michael Stonebraker,Anant Jhingran,Jeffrey Goh,Spyros Potamianos 1990 This paper demonstrates that a simple rule system can be constructed that supports a more powerful view system than available in current commercial systems. Not only can views be specified by using rules but also special semantics for resolving ambiguous view updates are simply additional rules. Moreover, procedural data types as proposed in POSTGRES are also efficiently simulated by the same rules system. Lastly, caching of the action part of certain rules is a possible performance enhancement and can be applied to materialize views as well as to cache procedural data items. Hence, we conclude that a rule system is a fundamental concept in a next generation DBMS, and it subsumes both views and procedures as special cases. SIGMOD Conference """The Committee for Advanced DBMS Function"": Third Generation Data Base System Manifesto." Michael Stonebraker,Lawrence A. Rowe,Bruce G. Lindsay,Jim Gray,Michael J. Carey,David Beech 1990 """The Committee for Advanced DBMS Function"": Third Generation Data Base System Manifesto." SIGMOD Conference The Input/Output Complexity of Transitive Closure. Jeffrey D. Ullman,Mihalis Yannakakis 1990 Suppose a directed graph has its arcs stored in secondary memory, and we wish to compute its transitive closure, also storing the result in secondary memory. We assume that an amount of main memory capable of holding s “values” is available, and that s lies between n, the number of nodes of the graph, and e, the number of arcs. The cost measure we use for algorithms is the I/O complexity of Kung and Hong, where we count 1 every time a value is moved into main memory from secondary memory, or vice versa. In the dense case, where e is close to n2, we show that I/O equal to &Ogr;(n3 / √s) is sufficient to compute the transitive closure of an n-node graph, using main memory of size s. Moreover, it is necessary for any algorithm that is “standard,” in a sense to be defined precisely in the paper. Roughly, “standard” means that paths are constructed only by concatenating arcs and previously discovered paths. This class includes the usual algorithms that work for the generalization of transitive closure to semiring problems. For the sparse case, we show that I/O equal to &Ogr;(n2 √e/s) is sufficient, although the algorithm we propose meets our definition of “standard” only if the underlying graph is acyclic. We also show that &OHgr;(n2 √e/s) is necessary for any standard algorithm in the sparse case. That settles the I/O complexity of the sparse/acyclic case, for standard algorithms. It is unknown whether this complexity can be achieved in the sparse, cyclic case, by a standard algorithm, and it is unknown whether the bound can be beaten by nonstandard algorithms. We then consider a special kind of standard algorithm, in which paths are constructed only by concatenating arcs and old paths, never by concatenating two old paths. This restriction seems essential if we are to take advantage of sparseness. Unfortunately, we show that almost another factor of n I/O is necessary. That is, there is an algorithm in this class using I/O &Ogr;(n3 √e/s) for arbitrary sparse graphs, including cyclic ones. Moreover, every algorithm in the restricted class must use &OHgr;(n3 √e/s/log3 n) I/O, on some cyclic graphs. SIGMOD Conference Polynomial Time Designs toward Both BCNF and Efficient Data Manipulation. Ke Wang 1990 We define the independence-reducibility based on a modification of key dependencies, which has better computational properties and is more practically useful than the original one based on key dependencies. Using this modification as a tool, we design BCNF databases that are highly desirable with respect to updates and/or query answering. In particular, given a set U of attributes and a set F of functional dependencies over U, we characterize when F can be embedded in a database scheme over U that is independent and is BCNF with respect to F, a polynomial time algorithm that tests this characterization and produces such a database scheme whenever possible is presented. The produced database scheme contains the fewest possible number of relation schemes. Then we show that designs of embedding constant-time-maintainable BCNF schemes and of embedding independence-reducible schemes share exactly the same method with the above design. Finally, a simple modification of this method yields a polynomial time algorithm for designing embedding separable BCNF schemes. SIGMOD Conference Set-Oriented Production Rules in Relational Database Systems. Jennifer Widom,Sheldon J. Finkelstein 1990 We propose incorporating a production rules facility into a relational database system. Such a facility allows definition of database operations that are automatically executed whenever certain conditions are met. In keeping with the set-oriented approach of relational data manipulation languages, our production rules are also set-oriented—they are triggered by sets of changes to the database and may perform sets of changes. The condition and action parts of our production rules may refer to the current state of the database as well as to the sets of changes triggering the rules. We define a syntax for production rule definition as an extension to SQL. A model of system behavior is used to give an exact semantics for production rule execution, taking into account externally-generated operations, self-triggering rules, and simultaneous triggering of multiple rules. SIGMOD Conference A New Paradigm for Parallel and Distributed Rule-Processing. Ouri Wolfson,Aya Ozeri 1990 This paper is concerned with the parallel evaluation of datalog rule programs, mainly by processors that are interconnected by a communication network. We introduce a paradigm, called data-reduction, for the parallel evaluation of a general datalog program. Several parallelization strategies discussed previously in [CW, GST, W, WS] are special cases of this paradigm. The paradigm parallelizes the evaluation by partitioning among the processors the instantiations of the rules. After presenting the paradigm, we discuss the following issues, that we see fundamental for parallelization strategies derived from the paradigm properties of the strategies that enable a reduction in the communication overhead, decomposability, load balancing, and application to programs with negation. We prove that decomposability, a concept introduced previously in [WS, CW], is undecidable. VLDB The C-based Database Programming Language Jasmine/C. Masaaki Aoshima,Yoshio Izumida,Akifumi Makinouchi,Fumio Suzuki,Yasuo Yamane 1990 The C-based Database Programming Language Jasmine/C. VLDB Hybrid Transitive Closure Algorithms. Rakesh Agrawal,H. V. Jagadish 1990 Hybrid Transitive Closure Algorithms. VLDB The Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data. Divyakant Agrawal,Amr El Abbadi 1990 The Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data. VLDB Adaptable Recovery Using Dynamic Quorum Assignments. Bharat K. Bhargava,Shirley Browne 1990 Adaptable Recovery Using Dynamic Quorum Assignments. VLDB An Incremental Join Attachment for Starburst. Michael J. Carey,Eugene J. Shekita,George Lapis,Bruce G. Lindsay,John McPherson 1990 An Incremental Join Attachment for Starburst. VLDB Concept Description Language for Statistical Data Modeling. "Tiziana Catarci,Giovanna D'Angiolini,Maurizio Lenzerini" 1990 Concept Description Language for Statistical Data Modeling. VLDB Consistency of Versions in Object-Oriented Databases. Wojciech Cellary,Geneviève Jomier 1990 Consistency of Versions in Object-Oriented Databases. VLDB Deriving Production Rules for Constraint Maintainance. Stefano Ceri,Jennifer Widom 1990 Deriving Production Rules for Constraint Maintainance. VLDB Indexing in a Hypertext Database. Chris Clifton,Hector Garcia-Molina 1990 Indexing in a Hypertext Database. VLDB A Parallel Strategy for Transitive Closure usind Double Hash-Based Clustering. Jean-Pierre Cheiney,Christophe de Maindreville 1990 A Parallel Strategy for Transitive Closure usind Double Hash-Based Clustering. VLDB The Effect of Skewed Data Access on Buffer Hits and Data Contention an a Data Sharing Environment. Asit Dan,Daniel M. Dias,Philip S. Yu 1990 The Effect of Skewed Data Access on Buffer Hits and Data Contention an a Data Sharing Environment. VLDB The Performance and Utility of the Cactis Implementation Algorithms. Pamela Drew,Roger King,Scott E. Hudson 1990 The Performance and Utility of the Cactis Implementation Algorithms. VLDB A Study of Three Alternative Workstation-Server Architectures for Object Oriented Database Systems. David J. DeWitt,Philippe Futtersack,David Maier,Fernando Vélez 1990 A Study of Three Alternative Workstation-Server Architectures for Object Oriented Database Systems. VLDB A Multidatabase Transaction Model for InterBase. Ahmed K. Elmagarmid,Yungho Leu,Witold Litwin,Marek Rusinkiewicz 1990 A Multidatabase Transaction Model for InterBase. VLDB The Time Index: An Access Structure for Temporal Data. Ramez Elmasri,Gene T. J. Wuu,Yeong-Joon Kim 1990 The Time Index: An Access Structure for Temporal Data. VLDB Non-Monotonic Knowledge Evolution in VLKDBs. Christian Esculier 1990 Non-Monotonic Knowledge Evolution in VLKDBs. VLDB Two Epoch Algorithms for Disaster Recovery. Hector Garcia-Molina,Christos A. Polyzois,Robert B. Hagmann 1990 Two Epoch Algorithms for Disaster Recovery. VLDB Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines. Shahram Ghandeharizadeh,David J. DeWitt 1990 Hybrid-Range Partitioning Strategy: A New Declustering Strategy for Multiprocessor Database Machines. VLDB A Probabilistic Framework for Vague Queries and Imprecise Information in Databases. Norbert Fuhr 1990 A Probabilistic Framework for Vague Queries and Imprecise Information in Databases. VLDB Parity Striping of Disk Arrays: Low-Cost Reliable Storage with Acceptable Throughput. Jim Gray,Bob Horst,Mark Walker 1990 Parity Striping of Disk Arrays: Low-Cost Reliable Storage with Acceptable Throughput. VLDB Query Processing for Multi-Attribute Clustered Records. Lilian Harada,Miyuki Nakano,Masaru Kitsuregawa,Mikio Takagi 1990 Query Processing for Multi-Attribute Clustered Records. VLDB Search Key Substitution in the Encipherment of B-Trees. Thomas Hardjono,Jennifer Seberry 1990 Search Key Substitution in the Encipherment of B-Trees. VLDB Distributed Transitive Closure Computations: The Disconnection Set Approach. Maurice A. W. Houtsma,Peter M. G. Apers,Stefano Ceri 1990 Distributed Transitive Closure Computations: The Disconnection Set Approach. VLDB An Adaptive Data Placement Scheme for Parallel Database Computer Systems. Kien A. Hua,Chiang Lee 1990 An Adaptive Data Placement Scheme for Parallel Database Computer Systems. VLDB On Restructuring Nested Relations in Partitioned Normal Form. Guy Hulin 1990 On Restructuring Nested Relations in Partitioned Normal Form. VLDB ILOG: Declarative Creation and Manipulation of Object Identifiers. Richard Hull,Masatoshi Yoshikawa 1990 ILOG: Declarative Creation and Manipulation of Object Identifiers. VLDB On Indexing Line Segments. H. V. Jagadish 1990 On Indexing Line Segments. VLDB Priority-Hints: An Algorithm for Priority-Based Buffer Management. Rajiv Jauhari,Michael J. Carey,Miron Livny 1990 Priority-Hints: An Algorithm for Priority-Based Buffer Management. VLDB Database Application Development as an Object Modeling Activity. Manfred A. Jeusfeld,Michael Mertikas,Ingrid Wetzel,Matthias Jarke,Joachim W. Schmidt 1990 Database Application Development as an Object Modeling Activity. VLDB Support for Temporal Data by Complex Objects. Wolfgang Käfer,Norbert Ritter,Harald Schöning 1990 Support for Temporal Data by Complex Objects. VLDB Database Updates through Abduction. Antonis C. Kakas,Paolo Mancarella 1990 Database Updates through Abduction. VLDB Right-, left- and multi-linear rule transformations that maintain context information. David B. Kemp,Kotagiri Ramamohanarao,Zoltan Somogyi 1990 Right-, left- and multi-linear rule transformations that maintain context information. VLDB Advanced Query Processing in Object Bases Using Access Support Relations. Alfons Kemper,Guido Moerkotte 1990 Advanced Query Processing in Object Bases Using Access Support Relations. VLDB Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC). Masaru Kitsuregawa,Yasushi Ogawa 1990 Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC). VLDB A Formal Approach to Recovery by Compensating Transactions. Henry F. Korth,Eliezer Levy,Abraham Silberschatz 1990 Compensating transactions are intended to handle situations where it is required to undo either committed or uncommitted transactions that affect other transactions, without resorting to cascading aborts. This stands in sharp contrast to the standard approach to transaction recovery where cascading aborts are avoided by requiring transactions to read only committed data, and where committed transactions are treated as permanent and irreversible. We argue that this standard approach to recovery is not suitable for a wide range of advanced database applications, in particular those applications that incorporate long-duration or nested transactions. We show how compensating transactions can be effectively used to handle these types of applications. We present a model that allows the definition of a variety of types of correct compensation. These types of compensation range from traditional undo, at one extreme, to application-dependent, special-purpose compensating transactions, at the other extreme. VLDB Triggered Real-Time Databases with Consistency Constraints. Henry F. Korth,Nandit Soparkar,Abraham Silberschatz 1990 Triggered Real-Time Databases with Consistency Constraints. VLDB Efficient Implementation of Loops in Bottom-Up Evaluation of Logic Queries. Juhani Kuittinen,Otto Nurmi,Seppo Sippu,Eljas Soisalon-Soininen 1990 Efficient Implementation of Loops in Bottom-Up Evaluation of Logic Queries. VLDB Hash-Based Join Algorithms for Multiprocessor Computers. Hongjun Lu,Kian-Lee Tan,Ming-Chien Shan 1990 Hash-Based Join Algorithms for Multiprocessor Computers. VLDB Referential Integrity Revisited: An Object-Oriented Perspective. Victor M. Markowitz 1990 Referential Integrity Revisited: An Object-Oriented Perspective. VLDB ARIES/KVL: A Key-Value Locking Method for Concurrency Control of Multiaction Transactions Operating on B-Tree Indexes. C. Mohan 1990 ARIES/KVL: A Key-Value Locking Method for Concurrency Control of Multiaction Transactions Operating on B-Tree Indexes. VLDB Commit_LSN: A Novel and Simple Method for Reducing Locking and Latching in Transaction Processing Systems. C. Mohan 1990 Commit_LSN: A Novel and Simple Method for Reducing Locking and Latching in Transaction Processing Systems. VLDB The Magic of Duplicates and Aggregates. Inderpal Singh Mumick,Hamid Pirahesh,Raghu Ramakrishnan 1990 The Magic of Duplicates and Aggregates. VLDB Performance Analysis of Disk Arrays under Failure. Richard R. Muntz,John C. S. Lui 1990 Performance Analysis of Disk Arrays under Failure. VLDB How to Forget the Past Without Repeating It. Jeffrey F. Naughton,Raghu Ramakrishnan 1990 Bottom-up evaluation of deductive database programs has the advantage that it avoids repeated computations by storing all intermediate results and replacing recomputation by table lookup. However, in general, storing all intermediate results for the duration of a computation wastes space. In this paper, we propose an evaluation scheme that avoids recomputation, yet for a fairly general class of programs at any given time stores only a small subset of the facts generated. The results constitute a significant first step in compile-time garbage collection for bottom-up evaluation of deductive database programs. VLDB Cooperative Transaction Hierarchies: A Transaction Model to Support Design Applications. Marian H. Nodine,Stanley B. Zdonik 1990 Cooperative Transaction Hierarchies: A Transaction Model to Support Design Applications. VLDB Measuring the Complexity of Join Enumeration in Query Optimization. Kiyoshi Ono,Guy M. Lohman 1990 Measuring the Complexity of Join Enumeration in Query Optimization. VLDB Efficient Main Memory Data Management Using the DBGraph Storage Model. Philippe Pucheral,Jean-Marc Thévenin,Patrick Valduriez 1990 Efficient Main Memory Data Management Using the DBGraph Storage Model. VLDB Synthesizing Database Transactions. Xiaolei Qian 1990 Synthesizing Database Transactions. VLDB Rule Ordering in Bottom-Up Fixpoint Evaluation of Logic Programs. Raghu Ramakrishnan,Divesh Srivastava,S. Sudarshan 1990 "Logic programs can be evaluated bottom-up by repeatedly applying all rules, in ""iterations"", until the fixpoint is reached. However, it is often desirable-and, in some cases, e.g. programs with stratified negation, it is even necessary to guarantee the semantics-to apply the rules in some order. We present two algorithms that apply rules in a specified order without repeating inferences. One of them (GSN) is capable of dealing with a wide range of rule orderings, but with a little more overhead than the well-known seminaive algorithm (which we call BSN). The other (PSN) handles a smaller class of rule orderings, but with no overheads beyond those in BSN. We also demonstrate that by choosing a good ordering, we can reduce the number of rule applications (and thus the number of joins). We present a theoretical analysis of rule orderings and identify orderings that minimize the number of rule applications (for all possible instances of the base relations) with respect to a class of orderings called fair orderings. We also show that though nonfair orderings may do a little better on some data sets, they can do much worse on others. The analysis is supplemented by performance results." VLDB Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines. Donovan A. Schneider,David J. DeWitt 1990 Tradeoffs in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines. VLDB The Buddy-Tree: An Efficient and Robust Access Method for Spatial Data Base Systems. Bernhard Seeger,Hans-Peter Kriegel 1990 The Buddy-Tree: An Efficient and Robust Access Method for Spatial Data Base Systems. VLDB Transaction Support in Read Optimizied and Write Optimized File Systems. Margo I. Seltzer,Michael Stonebraker 1990 Transaction Support in Read Optimizied and Write Optimized File Systems. VLDB Distributed Linear Hashing and Parallel Projection in Main Memory Databases. Charles Severance,Sakti Pramanik,P. Wolberg 1990 Distributed Linear Hashing and Parallel Projection in Main Memory Databases. VLDB Elimination of View and Redundant Variables in a SQL-like Database Language for Extended NF2 Structures. Norbert Südkamp,Volker Linnemann 1990 Elimination of View and Redundant Variables in a SQL-like Database Language for Extended NF2 Structures. VLDB A Temporal Relational Algebra as Basis for Temporal Relational Completeness. Alexander Tuzhilin,James Clifford 1990 A Temporal Relational Algebra as Basis for Temporal Relational Completeness. VLDB A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. Y. Richard Wang,Stuart E. Madnick 1990 A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. VLDB Query Processing for Distance Metrics. Jason Tsong-Li Wang,Dennis Shasha 1990 Query Processing for Distance Metrics. VLDB Maintaining Consistency of Client-Cached Data. W. Kevin Wilkinson,Marie-Anne Neimat 1990 Maintaining Consistency of Client-Cached Data. VLDB Factoring Augmented Regular Chain Programs. Peter T. Wood 1990 Factoring Augmented Regular Chain Programs. VLDB An Adaptive Hash Join Algorithm for Multiuser Environments. Hansjörg Zeller,Jim Gray 1990 An Adaptive Hash Join Algorithm for Multiuser Environments. SIGMOD Record An SQL-Based Query Language For Networks of Relations. Amit Basu,Rafiul Ahad 1990 A set of relations can be modeled as a network through the use of image attributes, which are attributes defined on the domain of relation names. Such networks of relations can effectively meet many of the modeling requirements of advanced database applications such as engineering design and knowledge base systems. In this paper, we describe the features of ESQL, a novel query language that is an extension of SQL [Cham74] to exploit the added semantics of image attributes. Details of ESQL and its underlying principles can be found in [Ahad88] and [Ahad89]. SIGMOD Record New Hope on Data Models and Types: Report of an NSF-INRIA Workshop. Serge Abiteboul,Peter Buneman,Claude Delobel,Richard Hull,Paris C. Kanellakis,Victor Vianu 1990 In May, 1990, a small workshop was held in New Hope, Pennsylvania, to discuss the fundamental issues raised by continuing work on the interface between databases and programming languages. Four topics were addressed: new directions stemming from object-oriented data models, contributions of type theory to database programming languages (DBPLs), applications of logic to DBPL issues, and DBPL implementations. This workshop was organized under the auspices of the INRIA-NSF program, Languages for Databases and Knowledge Bases. SIGMOD Record Object-Oriented Database Systems: In Transition. François Bancilhon,Won Kim 1990 Object-Oriented Database Systems: In Transition. SIGMOD Record Announcements and Calls for Papers: POS 90, Management of Replicated Data 90, IFIP WG8.4 90, ICDT 90, Multimedia Inf. Systems 91, JCIT-5, Software Eng. J., ER 90, DEXA 90, ... 1990 Announcements and Calls for Papers: POS 90, Management of Replicated Data 90, IFIP WG8.4 90, ICDT 90, Multimedia Inf. Systems 91, JCIT-5, Software Eng. J., ER 90, DEXA 90, ... SIGMOD Record Multidatabase Interoperability. Yuri Breitbart 1990 Multidatabase Interoperability. SIGMOD Record Announcements and Calls for Papers: VLDB Journal, Critical Issues 90, CR, SIGIR Forum, IEEE Computer, CCW 91, ICDCS 91, JCIT-5, ... 1990 Announcements and Calls for Papers: VLDB Journal, Critical Issues 90, CR, SIGIR Forum, IEEE Computer, CCW 91, ICDCS 91, JCIT-5, ... SIGMOD Record Announcements and Calls for Papers: SIGMOD 91, PODS 91, VLDB 91, ER 91, SIGIR 91, SSD 91, BNCOD-9, ECOOP 91, DEXA 91, WADS 91, SIGSOFT 91 1990 Announcements and Calls for Papers: SIGMOD 91, PODS 91, VLDB 91, ER 91, SIGIR 91, SSD 91, BNCOD-9, ECOOP 91, DEXA 91, WADS 91, SIGSOFT 91 SIGMOD Record Extensible Database Management Systems. Michael J. Carey,Laura M. Haas 1990 Extensible Database Management Systems. SIGMOD Record "Response to R. T. Snodgrass's Letter in SIGMOD Record." C. J. Date 1990 "Response to R. T. Snodgrass's Letter in SIGMOD Record." SIGMOD Record Parallel Database Systems: The Future of Database Processing or a Passing Fad? David J. DeWitt,Jim Gray 1990 The concept of parallel database machines consisting of exotic hardware has been replaced by a fairly conventional shared-nothing hardware base along with a highly parallel dataflow software architecture. Such a design provides speedup and scaleup in processing relational database queries. This paper reviews the techniques used by such systems, and surveys current commercial and research systems. SIGMOD Record Summary of the Final Report of the NSF Workshop on Scientific Database Management. James C. French,Anita K. Jones,John L. Pfaltz 1990 The National Science Foundation sponsored a two day workshop hosted by the University of Virginia on March 12-13, 1990 at which representatives from the earth, life, and space sciences met with computer scientists to discuss the issues facing the scientific community in the area of database management. The workshop1 participants concluded that initiatives by the National Science Foundation and other funding agencies, as well as specific discipline professional societies are urgently needed to address the problems facing scientists with respect to data management. This article presents a condensed version of the workshop final report emphasizing the technical research issues. SIGMOD Record Research Directions for Distributed Databases. Hector Garcia-Molina,Bruce G. Lindsay 1990 Research Directions for Distributed Databases. SIGMOD Record Four Valued Logic for Relational Database Systems. G. H. Gessert 1990 This paper proposes a specific four-valued logic (4VL) as a means of handling missing data in Relational Data Base Management systems. The proposed 4VL is a minor variant of the standard 4VL in which the “least true” condition is interpreted as “inapplicable” rather than “false”. The use of this 4VL is defended on the grounds that by defining several additional unary operators, the 4VL can be rendered intuitively manageable. The proposed unary operators contribute to the conceptual utility of the 4VL by providing an explicit way for users to relate familiar results of two-valued logic to analogous results in 4VL. SIGMOD Record A Note on the Translation of SQL to Tuple Calculus. Martin Gogolla 1990 This note presents a translation of a subset of the relational query language SQL into the well known tuple calculus. Roughly speaking, tuple calculus corresponds to first order predicate calculus. The SQL subset is relationally complete and represents a “relational core” of the language. Nevertheless, our translation is simple and elegant. Therefore it is especially well suited as a beginners course into the principles of a formal definition of SQL. SIGMOD Record Directions For Future Database Research & Development - Letter from the Issue Editor. Won Kim 1990 Directions For Future Database Research & Development - Letter from the Issue Editor. SIGMOD Record Research Issues in Spatial Databases. Oliver Günther,Alejandro P. Buchmann 1990 Research Issues in Spatial Databases. SIGMOD Record Database Security: Current Status and Key Issues. Sushil Jajodia,Ravi S. Sandhu 1990 Database Security: Current Status and Key Issues. SIGMOD Record Semantic Modeling through Identification and Characterization of Objects. Jan Jonsson 1990 Semantic Modeling through Identification and Characterization of Objects. SIGMOD Record Selected Database Research at Stanford. Arthur M. Keller,Peter Rathmann,Jeffrey D. Ullman,Gio Wiederhold 1990 This report describes seven projects at the Computer Science Department of Stanford University that may be relevant to SIGMOD community. SIGMOD Record "Chair's Message." Won Kim 1990 "Chair's Message." SIGMOD Record "Chair's Message." Won Kim 1990 "Chair's Message." SIGMOD Record "Chair's Message." Won Kim 1990 "Chair's Message." SIGMOD Record Database Security. Teresa F. Lunt,Eduardo B. Fernández 1990 Database Security. SIGMOD Record A Deductive Database Architecture Based on Partial Evaluation. Li Lei,Georges-Henri Moll,Jacques Kouloumdjian 1990 The implementation of a logic programming language for database management systems is a possible way to build a knowledge base management system. It allows to re-use the know-how in the fields: of inference engines, of data management. However, the strategy of each component is very different. SLD resolution leads to a one tuple at a time access to facts, opposed to the set oriented approach of databases. To face this problem, a new strategy, has been designed to keep the advantages of each part avoiding the drawbacks. The kernel of the system is a Partial Evaluator, designed as a metaprogram, which allows to bridge the impedance mismatch gap. SIGMOD Record The Implication Problem for Inclusion Dependencies: A Graph Approach. Rokia Missaoui,Robert Godin 1990 In this paper, we propose a graph theoretic approach to deal with the implication problem for inclusion dependencies. By analogy with functional dependencies, we define and present algorithms for computing the following concepts: the closure of a relation scheme R for X according to a set of inclusion dependencies and the minimal cover for inclusion dependencies. SIGMOD Record Accommodating Imprecision in Database Systems: Issues and Solutions. Amihai Motro 1990 Most database systems are designed under assumptions of precision of both the data stored in their databases, and the requests to retrieve data. In reality, however, these assumptions are often invalid, and in recent years considerable attention has been given to issues of imprecision in database systems. In this paper we review the major solutions for accommodating imprecision, and we describe issues that have yet to addressed, offering possible research directions. SIGMOD Record ACM News 1990 ACM News SIGMOD Record News 1990 News SIGMOD Record Yet Another Note on Minimal Covers. Jyrki Nummenmaa,Peter Thanisch 1990 In [Atk88] Atkins corrects a widely spread error in the algorithm for finding a minimal cover for a given set of functional dependencies. The erroneous form of the algorithm has been presented in [Sa186,StW83,Ul182,Yan88]. Unfortunately, though, there is an error also in the corrected algorithm. Atkins proposed the following algorithm for determining a minimal cover for a given set of functional dependencies F. SIGMOD Record Extending the Transaction Model to Capture more Meaning. Marek Rusinkiewicz,Ahmed K. Elmagarmid,Yungho Leu,Witold Litwin 1990 Extending the Transaction Model to Capture more Meaning. SIGMOD Record Congresses on Databases. Fèlix Saltor 1990 Congresses on Databases. SIGMOD Record Report on the Workshop on Heterogenous Database Systems held at Northwestern University, Evanston, Illinois, December 11-13, 1989, Sponsored by NSF. Peter Scheuermann,Ahmed K. Elmagarmid,Hector Garcia-Molina,Frank Manola,Dennis McLeod,Arnon Rosenthal,Marjorie Templeton 1990 Report on the Workshop on Heterogenous Database Systems held at Northwestern University, Evanston, Illinois, December 11-13, 1989, Sponsored by NSF. SIGMOD Record "Editor's Notes." Arie Segev 1990 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1990 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1990 "Editor's Notes." SIGMOD Record SIGMOD Institutional Sponsors 1990 SIGMOD Institutional Sponsors SIGMOD Record SIGMOD Institutional Sponsors 1990 SIGMOD Institutional Sponsors SIGMOD Record Database Research at Bellcore. Amit P. Sheth 1990 Database Research at Bellcore. SIGMOD Record "Database Systems: Achievements and Opportunities - The ""Lagunita"" Report of the NSF Invitational Workshop on the Future of Database System Research held in Palo Alto, California, February 22-23, 1990." Abraham Silberschatz,Michael Stonebraker,Jeffrey D. Ullman 1990 "Database Systems: Achievements and Opportunities - The ""Lagunita"" Report of the NSF Invitational Workshop on the Future of Database System Research held in Palo Alto, California, February 22-23, 1990." SIGMOD Record Data Base Research at Berkeley. Michael Stonebraker 1990 Data Base Research at Berkeley. SIGMOD Record Third-Generation Database System Manifesto - The Committee for Advanced DBMS Function. Michael Stonebraker,Lawrence A. Rowe,Bruce G. Lindsay,Jim Gray,Michael J. Carey,Michael L. Brodie,Philip A. Bernstein,David Beech 1990 Third-Generation Database System Manifesto - The Committee for Advanced DBMS Function. SIGMOD Record Temporal Databases - Status and Research Directions. Richard T. Snodgrass 1990 It seems somehow fitting to begin this paper on databases that store historical information with a chronology, touching briefly on all work that I am aware of in this area. I discuss in some detail what I consider to be the ten most important papers and events in terms of their impact on the discipline of temporal databases. These are emphatically not meant to detract from the other excellent papers in temporal databases. My goal is to characterize the evolution of this field, as an introduction to the approximately 350 papers specifically relating time to databases that have appeared thus far. I then identify and discuss areas where more work is needed. SIGMOD Record Correction. Toby J. Teorey 1990 Correction. SIGMOD Record Computing Transitive Closures of Multilevel Relations. Bhavani M. Thuraisingham 1990 Recently many attempts have been made to implement recursive queries. Such queries are essential for the new generation intelligent database system applications. Much of the effort in recursive queries is focussed on transitive closure queries which are of practical significance. None of the work described investigates recursive query processing in secure database management systems. In this paper we propose centralized and distributed algorithms for implementing transitive closure queries for multilevel secure relational database management systems. SIGMOD Record Deductive Databases: Achievements and Future Directions. Jeffrey D. Ullman,Carlo Zaniolo 1990 In the recent years, Deductive Databases have been the focus of intense research, which has brought dramatic advances in theory, systems and applications. A salient feature of deductive databases is their capability of supporting a declarative, rule-based style of expressing queries and applications on databases. As such, they find applications in disparate areas, such as knowledge mining from databases, and computer-aided design and manufacturing systems. In this paper, we briefly review the key concepts behind deductive databases and their newly developed enabling technology. Then, we describe current research on extending the functionality and usability of deductive databases and on providing a synthesis of deductive databases with procedural and object-oriented approaches. SIGMOD Record Incomplete Information in Object-Oriented Databases. Roberto Zicari 1990 We present a way to handle incomplete information both at schema and object instance level in an object-oriented database. Incremental schema design becomes possible with the introduction of generic classes. Incomplete data in an object instance is handled with the introduction of explicit null values in a similar way as in the relational and nested relations data models. ICDE Data Sharing in a Large Heterogeneous Environment. Rafael Alonso,Daniel Barbará,Steve Chon 1991 The issues involved in sharing information among a large collection of independent databases is explored. Some of the distinguishing features that characterize such large-scale environments (such as size, autonomy, and heterogeneity) are discussed. A multistep information sharing process for those systems is outlined and an architecture supporting that exchange is presented. A detailed description of a working prototype based on this architecture and some measurements of its performance are provided ICDE Performance Characteristics of Protocols With Ordered Shared Locks. Divyakant Agrawal,Amr El Abbadi,A. E. Lang 1991 A family of locking-based protocols is analyzed that use a novel mode of locks called ordered sharing. Using a centralized database simulation model, it is demonstrated that these protocols exhibit comparable performance to that of traditional locking-based protocols when data contention is low and exhibit superior performance when data contention is high. It is shown that the performance of these protocols improves as physical resources become more plentiful. This is particularly significant since two-phase locking degrades due to data and not resource contention. Thus, introducing additional resources improves the performance of the proposed protocols while it does not benefit two-phase locking significantly ICDE Object Versioning in Ode. Rakesh Agrawal,S. Buroff,Narain H. Gehani,Dennis Shasha 1991 "In designing the versioning facility in Ode, a few but semantically sound and powerful concepts are introduced that allow implementation of a wide variety of paradigms. Some of the salient features of these versioning facilities are the following: (1) object versioning is orthogonal to type; (2) reference to an object can be bound statically to a specific version of the object or dynamically to whatever is its latest version; and (3) both temporal as well as derived-from relationships between versions of an object are maintained automatically. These facilities have been incorporated seamlessly into Ode's database programming language, O++. The new language constructs are powerful enough to make O++ a suitable platform for implementing a variety of versioning paradigms and application-specific systems" ICDE ESQL: A Query Language for the Relation Model Supporting Image Domains. Rafiul Ahad,Amit Basu 1991 It is shown that by simply extending the relational data model to support image domains, the model becomes rich enough for many advanced applications. Extensions to the relational algebra and the relational calculus are developed to exploit the semantics of image domains. The extended calculus is incorporated in the query language ESQL, which is a strict superset of the structured query language (SQL). A technique to implement ESQL using a preprocessor to SQL is shown. A noteworthy observation is that it is possible to incorporate multirelations without any modifications to the storage structures and data definition language of SQL. It is also possible to support queries of multirelations in ESQL without modifying the features of SQL in any way as far as traditional data manipulation is concerned ICDE "Title, General Co-Chairpersons' Message, Program Co-Chairpersons' Message, Committees, Reviewers, Table of Contents, Author Index." 1991 "Title, General Co-Chairpersons' Message, Program Co-Chairpersons' Message, Committees, Reviewers, Table of Contents, Author Index." ICDE An Indexing Technique for Object-Oriented Databases. Elisa Bertino 1991 An Indexing Technique for Object-Oriented Databases. ICDE Constraint-Based Reasoning in Deductive Databases. Jiawei Han 1991 Constraint-based reasoning in deductive databases is studied, with the focus on set-oriented, constraint-based processing of functional linear recursions. A technique is developed which compiles a functional linear recursion into chain or bounded forms and analyzes efficient processing of the compiled chains based on different kinds of constraints. It is shown that rule constraints should be compiled together with the rectified recursions; finiteness constraints and monotonicity constraints should be used in the analysis of finite evaluability and termination; and query constraints should be pushed into the compiled chains, when possible, for efficient set-oriented evaluation. Constraint-based processing can be enhanced by dynamic constraint enforcement in query evaluation. The method is illustrated using a typical traversal recursion problem. It is concluded that the principles developed are useful for a large set of deductive database application problems ICDE Divide and Conquer: A Basis for Augmenting a Conventional Query Optimizer with Multiple Query Proceesing Capabilities. Sharma Chakravarthy 1991 Divide and Conquer: A Basis for Augmenting a Conventional Query Optimizer with Multiple Query Proceesing Capabilities. ICDE Inferential Modelling Technique for Constructing Second Generation Knowledge Based Systems. Christine W. Chan,Raymond E. Jennings,Paitoon Tontiwachwuthikul 1991 A novel inferential modeling technique (IMT) which aims to first clarify the domain ontology and inferences before proceeding onto the dynamic processing aspects of expertise is presented. The proposed model consists of the four levels of strategy, task, inference, and domain layers. The former two levels constitute the dynamic component and the latter two the static component. The static representation consists of the domain primitives and their interrelations, while the dynamic component represents a variety of tasks or activities which manipulate the domain entities in order to accomplish objectives. It is argued that the technique is useful as a tool in the elicitation and analysis phases of the development of a knowledge-based system. A preliminary application of the technique for constructing a knowledge-based system in the chemical engineering domain is described ICDE Determining Beneficial Semijoins for a Join Sequence in Distributed Query Processing. Ming-Syan Chen,Philip S. Yu 1991 The problem of combining join and semijoin reducers for distributed query processing is studied. An approach, of interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree that provides an efficient way to identify for each semijoin its correlated semijoins as well as its reducible relations under the join sequence. In light of these properties, an algorithm is developed to determine an effective sequence of join reducers. Examples are also given to illustrate the results, which show that the approach of interleaving a join sequence with beneficial semijoins is effective in reducing the amount of data transmission to process distributed queries ICDE An Efficient Hybrid Join Algorithm: A DB2 Prototype. Josephine M. Cheng,Donald J. Haderle,Richard Hedges,Balakrishna R. Iyer,Ted Messinger,C. Mohan,Yun Wang 1991 A new join method, called hybrid join, is proposed which uses the join-index filtering and the skip sequential prefetch mechanism for efficient data access. With this method, the outer table is sorted on the join column. Then, the outer is joined with the index on the join column of the inner. The inner tuple is represented by its surrogate, equivalent of its physical disk address, which is carried in the index. The partial join result is sorted on the surrogate and then the inner table is accessed sequentially to complete the join result. Local predicate filtering can also be applied before the access of the inner relation through the index AND/ORing. Efficient methods for skip sequential access and prefetching of logically discontiguous leaf pages of B+-tree indexes are also presented ICDE Using Type Inference and Induced Rules to Provide Intensional Answers. Wesley W. Chu,Rei-Chi Lee,Qiming Chen 1991 A new approach is presented that uses knowledge induction and type inference to provide intensional answers. Machine learning techniques are used to analyze database contents and to induce a set of if-then rules. Type inference which is based on forward inference and backward inference is developed that uses database type hierarchies to derive the intensional answers for a query. It is shown that more precise intensional answers can be derived by properly merging the type inference results from multiple type hierarchies. A prototype intensional query-processing system which uses the proposed approach has been implemented. Using a ship database as a testbed, the effectiveness of the use of type interference and induced rules to derive specific intensional answers is demonstrated ICDE Locking Granularity in Multiprocessor Database Systems. Sivarama P. Dandamudi,Siu-Lun Au 1991 The effects of locking granularity in a shared-nothing multiprocessor database system are analyzed. The analysis shows that when the system is lightly loaded fine granularity (one lock per database entity) is needed when transactions access the database randomly. However, when transactions access the database sequentially, coarse granularity is desired. When the system is heavily loaded, coarse granularity is desirable. The results also indicate that horizontal partitioning results in better performance than random partitioning ICDE Optimization of Generalized Transitive Closure Queries. Shaul Dar,Rakesh Agrawal,H. V. Jagadish 1991 Two complementary techniques for optimizing generalized transitive closure queries are presented: (i) selections on paths are applied during the closure computation, so that paths that are not in the result and that are not needed to compute the result are pruned as early as possible and (ii) paths that are in the result, or needed to compute the result, are represented in a condensed form. The condensed representation holds the minimal information that is necessary for the specified label computations and selections to be performed. The combined impact of these techniques is that the number of paths generated during the closure computation and the storage required for each such path are both greatly reduced ICDE Object-Centered Constraints. Lois M. L. Delcambre,Billy B. L. Lim,Susan Darling Urban 1991 The notion of object-centered (O-C) constraints, a subset of range-restricted Horn clauses with existential variables is presented. O-C constraints correspond, intuitively, to sets of clauses that involve connected objects or objects that can be located through navigation. Other types of constraints, e.g. transition or other temporal constraints are not specifically addressed. The research philosophy adopted is that such constraints can be expressed by introducing additional constructs into the schema, e.g. old-Student or new-Student and then expressing constraints in the same manner as the O-C constraints. This approach is conceptually accurate; it provides a uniform method for expressing and enforcing constraints using generalized constraint analysis. It allows the user to precisely describe the intended semantics, e.g. when to delete the old values after an update. This philosophy basically states that database constraints necessarily constrain database objects; in order to address additional issues like old/new values, the additional structures should be brought into the purview of the database by defining them in the database schema ICDE Maintaining Quasi Serializability in Multidatabase Systems. Weimin Du,Ahmed K. Elmagarmid,Won Kim 1991 A scheduler producing quasi-serializable executions for concurrency control in multidatabase systems (MDBSs) is presented. An algorithm is proposed which ensures quasi-serializability by controlling submissions of global transactions. The algorithm groups global transactions in such a way that transactions in a group affect each other in a partial order. Transaction groups are executed separately and in a consistent order at all local sites. The algorithm differs from the others in that it does not violate local autonomy, provides a high degree of concurrency, and is globally deadlock-free ICDE A Polymorphic Relational Algebra and Its Optimization. David Eichmann,D. Alton 1991 The notion of a polymorphic database and the optimization of polymorphic queries-specifically, optimization of queries under the Morpheus data model-is addressed. The notion of query optimization through type inference, applicable both to polymorphic databases and traditional monomorphic databases, is introduced. The Morpheus data model and its type inference rules are reviewed and a polymorphic relational algebra is characterized. It is shown how the inference rules can be used for static optimization of a few sample queries. It is concluded that type inference provides a formal mechanism for optimizing a very rich extension to the relational algebra. The approach retains the basic framework that lead to the wide acceptance of the relational model, while enriching it with the structural expressiveness of the object-oriented approaches of recent years ICDE Efficient Implementation Techniques For the Time Index. Ramez Elmasri,Yeong-Joon Kim,Gene T. J. Wuu 1991 A new indexing technique, time index, for improving the performance of certain classes of temporal queries was previously described by the author (see The 16th Conference on Very Large Databases (1990)). Three variations for implementing the time index efficiently are proposed and the performance of these three variations is compared with the performance of the original time index. Various parameters such as average lifetime of a version, average number of versions per object, block size and block clustering, and query time interval length are discussed. Simulation results show how these parameters affect the performance of the various implementation variations for the time index ICDE DOT: A Spatial Access Method Using Fractals. Christos Faloutsos,Yi Rong 1991 DOT: A Spatial Access Method Using Fractals. ICDE A Rule-Based Query Rewriter in an Extensible DBMS. Béatrice Finance,Georges Gardarin 1991 An integrated approach to query rewriting in an extensible database server supporting ADTs, objects, deductive capabilities and integrity constraints is described. The approach is extensible through a uniform high level rule language used by the database implementor to specify optimization techniques. This rule language is compiled to enrich the strategy component and the knowledge base of the rewriter. Rules can be added to specify various aspects of query rewriting, including operation permutation, recursive query processing, integrity constraint addition, predicate simplification and method call simplification ICDE Wait Depth Limited Concurrency Control. Peter A. Franaszek,John T. Robinson,Alexander Thomasian 1991 A new class of wait depth limited (WDL) concurrency control (CC) methods is described. The WDL policy is shown by simulations to be effective both in systems with proportionately large I/O latencies as well as in systems with large numbers of processors. WDL is also attractive in terms of implementation. In many applications little or no system modification may be required other than the CC code. Since it is a lock-based method, unlike optimistic methods, there is no need for private copies of modified data for each transaction, nor is there any need for snapshot, timestamp, or versioning mechanisms to guarantee that transactions are always provided with a fully consistent database image ICDE An Analysis Technique for Transitive Closure Algorithms: A Statistical Approach. Sumit Ganguly,Ravi Krishnamurthy,Abraham Silberschatz 1991 A novel experimental procedure, based on a standard statistical estimation procedure, is presented to estimate the performance of transitive closure algorithms. This experimental procedure has been exemplified in three contexts: (1) comparison of a suite of algorithms: (2) analysis of one particular algorithm; and (3) analysis of the transitive closure problem itself. It is shown that the number of duplicate edges generated (by most algorithms) can be more than ten times the size of the transitive closure, even for small graphs. The majority of these duplicates are due to the existence of strongly connected components in the graph. This experimental approach can be generalized to estimate various performance metrics for a large class of database queries. It is both simple and general and provides the necessary ingredients for a guess-and-verify paradigm of testing hypotheses ICDE On Serializability of Multidatabase Transactions Through Forced Local Conflicts. Dimitrios Georgakopoulos,Marek Rusinkiewicz,Amit P. Sheth 1991 A multidatabase transaction management mechanism called the optimistic ticket method (OTM) is introduced for enforcing global serializability. It permits the commitment of multidatabase transactions only if their relative serialization order is the same in all participating local database systems (LDBSs). OTM requires the LDBSs to guarantee only local serializability. The basic idea in OTM is to create direct conflicts between multidatabase transactions at each LDBS in order to determine the relative serialization order of their subtransactions. A refinement of OTM, called the implicit ticket method (ITM), is also introduced that uses implicit tickets and eliminates ticket conflicts but works only when the participating LDBSs use rigorous transaction scheduling mechanisms. ITM uses the local commitment order of each subtransaction to determine its implicit ticket value. It achieves global serializability by controlling the commitment (execution order) and thus the serialization order of multidatabase transactions. Both OTM and ITM do not violate the autonomy of the LDBSs and can be combined in a single comprehensive mechanism ICDE Real Time Retrieval and Update of Materialized Transitive Closure. Keh-Chang Guh,Chengyu Sun,Clement T. Yu 1991 "A data structure is used to store materialized transitive closure such that the evaluation of transitive closure, deletions and insertions of tuples can be performed efficiently. Experiments have been carried out on a Sun/3/180 system. It is verified experimentally and theoretically that it takes on the average O(m"") to retrieve the ancestors/descendants of the given node, where m"" is the number of ancestors/descendants of the given node, and it takes on the average O(m*m') to perform an insertion or a deletion of a tuple (a,b), where m is the number of ancestors of a+1 and m' is the number of descendants of b+1. It is shown that, when the data types is integer, retrieval of the ancestors/descendants of a given node takes no more than 0.0001 s; insertion/deletion of a tuple and the corresponding update involving m*m""=elements in the data structure takes approximately 0.07 s. When data type is a string of length 20, the corresponding retrieval time and insertion/deletion times are 0.0008 s and 1.5 s respectively" ICDE Query Processing Algorithms for Temporal Intersection Joins. Himawan Gunadhi,Arie Segev 1991 Intersection join processing in temporal relational databases is investigated. An analysis is presented of the characteristics and processing requirements of three types of intersection join operators: the time-join temporal equijoin on the surrogate, and temporal equijoin on the temporal attribute. Based on the physical organization of the database and on the semantics of the operators, several algorithms were developed to process these joins efficiently. The primary cost variables were identified for each algorithm and their performance is compared to that of a conventional nested-loop join procedure. It is shown that the algorithms developed can reduce processing costs significantly ICDE Spatial Database Indices for Large Extended Objects. Oliver Günther,Hartmut Noltemeier 1991 An analytic model is given for the behavior of dynamic spatial index structures under insertions and deletions. Based on this model a new tool, called the oversize shelf to improve the performance of tree-based indices by minimizing redundancy, is optimized and evaluated. Oversize shelves are extra disk pages that are attached to the interior nodes of a tree-based spatial index structure. These pages are used to accommodate very large data objects in order to avoid their excessive fragmentation. Whenever inserting a new object into the tree, it should be decided whether to store it on an oversize shelf or to insert it into the subtrees. From the analytic model a threshold q* for the object size is obtained. If the object is larger than q*, it is more favorable to put it on the oversize shelf, otherwise the insertion into the subtrees is preferable ICDE An Association Algebra For Processing Object-Oriented Databases. Mingsen Guo,Stanley Y. W. Su,Herman Lam 1991 An association algebra (A-algebra) is presented for manipulating object-oriented (O-O) databases which is analogous to the relational algebra for relational databases. In this algebra, objects and their associations in an O-O database are uniformly represented by association patterns and are manipulated by a number of operators. These operators are defined to operate on association patterns of both heterogeneous and homogeneous structures. Very complex structures (e.g. network structures of object associations across several classes) can be directly manipulated by these operators. Therefore, the association algebra has greater expressive powers than the relational algebra which manipulates on relations of compatible structures. Some mathematical properties of these operators are described together with their application in query decomposition and optimization. The algebra has been used as the basis for the design and implementation of an O-O query language called OQL and a knowledge rule specification language ICDE Modeling Transition. Gary Hall,Ranabir Gupta 1991 It is argued that the dynamics of an application domain is best modeled as patterns of change in the entities that make up the domain. An abstraction mechanism for semantic data models is described which represents the transition of domain entities among entity classes. The model of transitions is related to a general computational formalism with well-understood properties. It is shown that the transition abstraction mechanism facilitates the accurate conceptual modeling of the static nature of the domain, assists in the design of database transactions, enables certain kinds of inference, and leads to the ability of a database to actively respond at a high level to low level updates of the data it contains ICDE Preserving and Generating Objects in the LIVING IN A LATTICE Rule Language. Andreas Heuer,Peter Sander 1991 "LIVING IN A LATTICE is presented as a rule-based query language for an object-oriented database model. The model supports complex objects, object identity, and is-a-relationships. The instances are described by object relations, which are functions from a set of objects to value sets and other object sets. The rule language is based on object-terms which provide an access to objects via is-a-relationships. Rules are divided into two classes: object-preserving rules manipulating existing objects and object-generating ones creating objects with properties derived from existing objects. The derived object sets are included in a lattice of object types. Some conditions are given under which the instances of the rule's heads are consistent, i.e., where the properties of the derived objects are functionally determined by the objects" ICDE Unilateral Commit: A New Paradigm for Reliable Distributed Transaction Processing. Meichun Hsu,Abraham Silberschatz 1991 "An alternative approach to distributed transaction processing based on the unilateral commit paradigm (UCP) and on persistent transmission is proposed. Instead of executing a unit of work as a single distributed transaction, as in the traditional transaction execution paradigm, opportunities are looked for to execute it as a structured set or a sequence of smaller, possibly single-site atomic transactions. Each such transaction, once executed, is committed independently of other transactions in the task. A method for rigorously maintaining the linkage between the steps is provided for by a persistent transmission mechanism. It is argued that UCP is especially attractive since it relies on a site's ability to execute conventional flat local transactions and does not require additional capabilities such as the ability to execute nested transactions" ICDE Parallel Computation of Direct Transitive Closures. Yan-Nong Huang,Jean-Pierre Cheiney 1991 "To efficiently process recursive queries in a DBMS (database management system), a parallel, direct transitive closure algorithm is proposed. Efficiency is obtained by reorganizing the computation order of Warren's algorithm. The number of transfers among processors depends only on the number of processors and does not depend on the depth of the longest path. The evaluation shows an improvement due to the parallelism and the superiority of the proposed algorithm over recent propositions. The speed of the production of new tuples is very high and the volume of transfers between the sites is reduced" ICDE Navigation and Schema Transformations for Producing Nested Relations form Networks. Mizuho Iwaihara,Tetsuya Furukawa,Yahiko Kambayashi 1991 Navigation and Schema Transformations for Producing Nested Relations form Networks. ICDE Object/Behavior Diagrams. Gerti Kappel,Michael Schrefl 1991 A novel diagram technique is presented to depict the structure as well as the behavior of objects. One of its distinguishing characteristics is its strict adherence to the object-oriented paradigm. A first prototype of an editor for object/behavior diagrams has been developed and is running on SUN-workstations. To assist the user the editor provides hypertext style facilities for navigating through different diagrams. For example, by clicking on an activity in a life cycle diagram one moves to the activity specification diagram of that activity ICDE Precomputation in a Complex Object Environment. Anant Jhingran 1991 "Certain analytical results are established about precomputation in a complex object environment. The concept of `intervals' is introduced and it is shown that object-identifier caching might be beneficial provided the frequency of updates to the set of subobjects associated with an object is not too high. However, in most cases, procedures with value caching outperformed other forms of representations. Certain performance characteristics of a knapsack-based algorithm are established which can be used to optimally decide which of the precomputed results to cache. Simulation results demonstrate how well this scheme performs. In particular, a binary search strategy seems ideally suited for caching" ICDE A Knowledge-Based Subsystem for a Natural Language Interface to a Database that Predicts and Explains Query Failures. Stefan W. Joseph,Romas Aleliunas 1991 "A practical method is presented for eliminating unnecessary database operations that often arise from poorly posed natural language queries. This subsystem consists mainly of a knowledge-base whose rules are semantic constraints of the database. The inference procedure's actions are, unlike that of similar inference engines, strictly controlled by the structure of the query. Because of this it is easy to implement and it is relatively fast. It is believed that this subsystem represents a practical incremental improvement that can be made to any relational database front end, not just those that must cope with natural language queries. The procedure also has value as part of a practical semantic query optimizer" ICDE Compiling a Rule Database Program into a C/SQL Application. Gerald Kiernan,Christophe de Maindreville 1991 The design and the implementation of a rule database language (RDL) compiler is presented. In this design, the RDL/C language supports both declarative programming based on a production rule language and C-based procedural programming. The data model is relational. This implies that all rule programs can be solved without having to download data from the database management system (DBMS) into some working memory. The language supports domain variables which can appear in rules. These variables are monitored by the inference engine and included in the semantics of rule firing. A partial ordering among rules is available to the user. The RDL/C compiler translates RDL/C source code into C code with embedded structured query language (SQL) statements. Its implementation is compared to fully integrated deductive databases and to loosely coupled systems. It is shown how the rule-based paradigm for a database can be used as a framework for a general-purpose database application generator ICDE Performance Evaluation of Functional Disk System (FDS-R2). Masaru Kitsuregawa,Miyuki Nakano,Mikio Takagi 1991 The performance of the functional disk system with relational database engine (FDS-R2) is evaluated in detail in view of two points. First, the performance evaluation of the combined hash algorithm on FDS-R2 is reported using the projection and aggregation operations in addition to the join operation and they are analyzed in order to verify the effectiveness of the proposed processing method, Second, several measured results of performance evaluations with the expanded version of the Wisconsin Benchmark are given and analyzed. FDS-R2 attained higher performance for very large relations as compared to other large database systems such as Gamma and Teradata. In this evaluation, it is also shown that the performance of the relational operations can be improved largely by using an efficient hashing strategy for large relations on FDS-R2 ICDE The Software Architecture of a Parallel Processing System for Advanced Database Applications. Yasushi Kiyoki,Takahiro Kurosawa,Kazuhiko Kato,Takashi Masuda 1991 A parallel processing scheme and software architecture of SMASH, a parallel processing system for supporting a wide variety of database applications is presented. The main feature of this system is that functional programming concepts are applied to define new database operations and data types and to exploit parallelism inherent in an arbitrary set of database operations. A primitive set (SMASH primitive set) of the software architecture is presented that defines an abstract machine interface between high-level database languages and general-purpose hardware systems for parallel processing. The primitive set is used to implement functional computation systems for executing arbitrary database operations in parallel. A previously proposed stream-oriented parallel processing scheme for relational database operations is extended to support more complex database operations which deal with complex data structures. Several experimental results of parallel processing for database operations are shown to clarify feasibility of the proposed architecture ICDE Semantic Query Reformulation in Deductive Databases. Sang-goo Lee,Lawrence J. Henschen,Ghassan Z. Qadah 1991 A method is proposed for identifying relevant integrity constraints (ICs) for queries involving joins/unions of base relations and defined relations by use of graphs. The method does not rely on heavy preprocessing or redundancy. To effectively select those ICs that are relevant to a given query, the relationship between the predicates in the query is identified using an AND/OR tree where an AND mode represents a join operation and an OR node represents a union operation. Ways of collecting ICs are described that are not directly related to the query but can be useful in query optimization ICDE Processing of Multiple Queries in Distributed Databases. A. Y. Lu,Phillip C.-Y. Sheu 1991 Processing of Multiple Queries in Distributed Databases. ICDE Performance Measurement of Some Main Memory Database Recovery Algorithms. Vijay Kumar,Albert Burger 1991 "The performance of two main-memory database system (MDBS) algorithms is studied. M.H. Eich's (Proc. of the Fifth Int. Workshop on Database Machines., Oct. 1987) and T.J. Lehman's (Ph. D. Thesis, Univ. of Wisconsin-Madison, Aug 1986) algorithms are selected for this study. The investigation indicates that the shadow approach, as used by Eich, has some advantages over the update in-place strategy of Lehman. The shadow approach has faster response for normal transactions even though transaction commit is slower compared to update in-place approach. It is concluded that irrespective of the recovery algorithm an efficient load balancing plays an important part in the performance of the system. It is suggested that processors should be allocated to any kind of activity on a demand basis. It is discovered that individual recovery of lockable units as used in Lehman's algorithm is a better choice since it increases the system availability after a system failure. It also allows background recovery to go in parallel to normal transaction processing" ICDE Incremental Restart. Eliezer Levy 1991 Incremental Restart. ICDE Natural Joins in Relational Databases with Indefinite and Maybe Information. Ken-Chih Liu,Lu Zhang 1991 Natural Joins in Relational Databases with Indefinite and Maybe Information. ICDE Interactive Manipulation of Object-oriented Views. Jean-Claude Mamou,Claudia Bauzer Medeiros 1991 An approach to providing view support in object-oriented (O2 ) databases is presented. The approach uses what is called a hyper-view and combines results from research on database design and man-machine interfaces. A mechanism that allows interacting with databases through hyper-views and built on top of the O2 system is described. The approach is based on conciliating stored data and their visualization. Not only views corresponding to one class, but also multiclass views are considered. Hyper-views are supported by ToonMaker, a user interface generator system implemented for the O2 database management system ICDE Problems Underlying the Use of Referential Integrity in Relational Database Management Systems. Victor M. Markowitz 1991 Problems Underlying the Use of Referential Integrity in Relational Database Management Systems. ICDE Conflict-driven Load Control for the Avoidance of Data-Contention Thrashing. Axel Mönkeberg,Gerhard Weikum 1991 A conflict-driven approach to automatic load control is presented. Various definitions of conflict rate are investigated as to whether they are suitable as a control metric. Evidence is provided that there exists at least one suitable metric and a single value, called the critical conflict rate, that indicates data-contention (DC) thrashing regardless of the number or types of transactions in the system. Based on this observation, an algorithm is developed that admits new transactions and/or cancels running transactions depending on the current conflict rate. The algorithm and its various substrategies for transaction admission and transaction cancellation are evaluated under several sorts of overload situations. Simulation experiments with this algorithm have shown fairly good results, i.e. DC thrashing was prevented in overload situations without overly limiting the achievable throughput under regular conditions. Load control is fully automated, i.e., it does not require any manual tuning parameters ICDE ARIES-RRH: Restricted Repeating of History in the ARIES Transaction Recovery Method. C. Mohan,Hamid Pirahesh 1991 "A method called ARIES-RRH (algorithm for recovery and isolation exploiting semantics with restricted repeating of history) is presented which is a modified version of the ARIES transaction recovery and concurrency control method implemented to varying degrees in Starburst, QuickSilver, the OS/2 Extended Edition Database Manager, DB2 V2, Workstation Data Save Facility/VM and the Gamma database machine. The repeating history paradigm of ARIES is analyzed to propose a more efficient handling of redos, especially when the smallest granularity of locking is not less than a page, by combining the paradigm of selective redo from DB2 V1. Even with fine-granularity locking, it is not always the case that all the unapplied but logged changes needed to be redone. ARIES-RRH, which incorporates these changes, still retains all the good properties of ARIES-avoiding undo of undos, single-pass media recovery, nested top actions, etc. The fundamentals behind why DB2 V1's selective redo works, in spite of failures during restart recovery, are also explained" ICDE Exploiting Parallelism in the Implementation of Agna, a Persistent Programming System. Rishiyur S. Nikhil,Michael L. Heytens 1991 A design for AGNA, a persistent object system that utilizes parallelism in a fundamental way to enhance performance, is presented. The underlying thesis is that fine-grained parallelism is essential for achieving scalable performance on parallel multiple instruction/multiple data (MIMD) machines. This, in turn, implies a data-driven model of computation for efficiency. The complete design based on these principles starts with a declarative source language because such languages reveal the most fine-grained parallelism. It is described how transactions are compiled into an abstract, fine-grained parallel machine called P-RISC. The P-RISC virtual heap is implemented in the memory and disk of a parallel machine in such a way that paging is overlapped with useful computation. The current implementation status is described, some preliminary performance results are reported and the approach presented is compared to several recent parallel database system projects ICDE Execution Plan Balancing. Marguerite C. Murphy,Ming-Chien Shan 1991 A novel relational query optimization technique for use in shared memory multiprocessor database systems is described. A collection of practical algorithms for allocating computational resources to parallel select-project filter (SPJ) query execution plans is presented. The computational resources considered include disk bandwidth, memory buffers and general-purpose processors. The goal of the allocation algorithms is to produce minimum duration execution strategies with computational resource requirements that are less than the given system bounds. Preliminary experimental results indicate that the algorithms can be realized and are effective in producing good execution plans. Disk bandwidth appears to be the critical system resource. The most effective means to decrease complex query response time appears to be by reducing disk contention. This can be achieved by increasing the total number of disks and/or rearranging the placement of data on disks ICDE Scheduling Batch Transactions on Shared-Nothing Parallel Database Machines: Effects of Concurrency and Parallelism. Tadashi Ohmori,Masaru Kitsuregawa,Hidehiko Tanaka 1991 Concurrency-control scheduling of batch transactions on shared-nothing (or loosely-coupled) multiprocessor database machines is discussed. Various schedulers are tested for these batch transactions to examine how well they perform when both intertransaction parallelism and intratransaction parallelism are limited. New schedulers designed for batch transaction processing are outlined which use a new tool called a weighted transaction-precedence graph (WTPG). Simulation results show that two new schedulers (globally and locally optimized WTPG schedulers) are the best performers under various workloads ICDE Atomic Commitment for Integrated Database Systems. Peter Muth,Thomas C. Rakow 1991 A systematic discussion of atomic commitment for heterogeneous database systems is presented. An analysis is given of two alternative protocols for atomic commitment: commitment of local transaction after or before the global commit or abort decision is made. The impact of the protocols on recovery and concurrency control is shown. Atomicity, consistency, isolation, and durability properties are achieved for global transactions. It is demonstrated that commitment before fits best to multilevel transactions. In this case, the commitment protocol causes no additional overhead and a higher degree of concurrency can be achieved ICDE Interval Assignment for Periodic Transactions in Real-Time Database Systems. Hidenori Nakazato,Kwei-Jay Lin 1991 The problem of assigning execution intervals for periodic transactions in real-time databases is discussed. The object value evolution rate and the importance of the object are used as two factors for deciding transaction periods. Two different objective functions are defined to reflect different system design goals. Algorithms for optimizing each objective function are presented. The principle behind these algorithms is to allow the transactions which have higher weights to be executed more often. It is assumed that systems use the rate monotonic algorithm for scheduling transactions. Many other scheduling algorithms, like the earliest deadline first algorithm can also be used. Some examples using the proposed algorithms are given ICDE A Methodology for Benchmarking Distributed Database Management Systems. Cyril U. Orji 1991 A methodology for benchmarking distributed database management systems is proposed. A distributed environment is characterized in terms of the communication costs incurred in data movement between sites, the number of nodes that participate in processing a query and the data distribution scheme used in the network. These three major characteristics form a basis for eight query types that capture the query performance characteristics in the network. It is demonstrated that the performance characteristics of any distributed database management system can be captured by running queries based on these query types. Preliminary results obtained in applying the methodology in a single-user LAN environment are presented ICDE An Efficient Semantic Query Optimization Algorithm. HweeHwa Pang,Hongjun Lu,Beng Chin Ooi 1991 An efficient semantic query optimization algorithm is proposed, in which all possible transformations are tentatively applied to the query. Instead of physically modifying the query, the transformation process classifies the predicates into imperative, optional or redundant. At the end of the transformation process, all the imperative predicates are retained while the redundant predicates are eliminated. Optional predicates are retrained or discarded based on the estimated cost/benefit of retaining them. The issue of the grouping of semantic constraints to reduce the overhead of retrieving constraints and checking whether each constraint is relevant to the current query is also addressed. Based on the proposed algorithm, a prototype semantic query optimizer has been built and preliminary experiments show that the optimizer performs well for large databases ICDE Voting with Regenerable Volatile Witnesses. Jehan-François Pâris,Darrell D. E. Long 1991 Voting protocols ensure the consistency of replicated objects by requiring all read and write requests to collect an appropriate quorum of replicas. It is proposed to replace some of these replicas with volatile witnesses that have no data and require no stable storage, and to regenerate them instead of waiting for recovery. The small size of volatile witnesses allows them to be regenerated much easier than full replicas. Regeneration attempts are also much more likely to succeed since volatile witnesses can be stored on diskless sites. It is shown that under standard Markovian assumptions two full replicas and one regenerable volatile witness managed by a two-tier dynamic voting protocol provide a higher data availability than three full replicas managed by majority consensus voting or optimistic dynamic voting provided site failures can be detected significantly faster than they can be repaired ICDE Request Order Linked List (ROLL): A Concurrency Control Object for Centralized and Distributed Database Systems. William Perrizo 1991 A database concurrency control object called ROLL (request order linked list), which is a linked list of bit vectors, is introduced together with three simple operations available to transactions: POST, CHECK and RELEASE. POST is used to establish serialization order. CHECK is used to determine current resource availability. RELEASE is used to relinquish resources. ROLL is based on the serialization graph testing method, but no system scheduler module is involved. Using ROLL, waiting, restarting, deadlock and livelock are minimized and almost all operations can be invoked in parallel by individual transaction manager modules. The ROLL object, performance, problems and desirable extensions are discussed ICDE Domain Vector Accelerator for Relational Operations. William Perrizo,James Gustafson,Daniel Thureen,David Wenberg 1991 Domain Vector Accelerator for Relational Operations. ICDE Perfect Hashing Functions for Hardware Applications. M. V. Ramakrishna,G. A. Portice 1991 Perfect hashing functions are determined that are suitable for hardware implementations. A trial-and-error method of finding perfect hashing functions is proposed using a simple universal2 class (H3) of hashing functions. The results show that the relative frequency of perfect hashing functions within the class H3 is the same as predicted by the analysis for the set of all functions. Extensions of the basic scheme can handle dynamic key sets and large key sets. Perfect hashing functions can be found using software, and then loaded into the hardware hash address generator. Inexpensive associative memory can be used as a general memory construct offered by the system services of high-performance (super) computers. It has a potential application for storing operating system tables or internal tables for software development tools, such as compilers, assemblers and linkers. Perfect hashing in hardware may find a number of other applications, such as high speed event counting and text searching ICDE Spatial Join Indices. Doron Rotem 1991 Algorithms based on grid files as the underlying spatial index are presented for spatial joins in databases which store images, pictures, maps and drawings. For typical data distributions, it is shown that the size of the index and its maintenance cost are relatively small. The effect of diagonal distributions and different densities of the two grid files on the size of the index is also studied. It is expected that similar algorithms can be employed with other types of multidimensional data structures ICDE A Semantic Integrity Framework: Set Restrictions for Semantic Groupings. Elke A. Rundensteiner,Lubomir Bic,Jonathan P. Gilbert,Meng-Lai Yin 1991 Three of the most common fundamental groupings that are utilized in semantic database models are considered: set groupings, power set groupings, and Cartesian aggregation groupings. For each, useful restrictions that control its structure and composition are defined. This permits each grouping to capture more subtle distinctions of the concepts or situations in the application environment. The resulting set of restrictions forms a framework for integrity constraints in semantic data models. This framework is targeted towards advanced applications, such as computer-aided design, office automation, and artificial intelligence, which require the support of more sophisticated relationships among data than traditional database domains ICDE Modeling Uncertainty in Databases. Fereidoon Sadri 1991 "Relational algebra operations were extended to produce, together with answers to queries, information regarding sources that contributed to the answers. The author's previous model is reviewed and the semantic interpretation is presented. It is shown that extended relational algebra operations are precise, that is, they produce exactly the same answers that are expected under the semantic interpretation. Algorithms for computing the reliability of answers to a query are also reviewed and their correctness under the semantic interpretation proposed is proved" ICDE Meta-reasoning: An Incremental Compilation Approach. Abdul Sattar,Randy Goebel 1991 "An incremental compilation approach to meta-reasoning is presented together with a method to update dynamically changing knowledge bases. The compilation process translates meta-level specification of facts and hypotheses into sentences of clausal logic. It then incrementally computes inconsistent sets of instances of hypotheses and records potential crucial literals. The extra information computed during compilation enables the theorem prover to avoid redundant computations and to efficiently update the compiled knowledge. Whenever a new fact is learned the effects of the fact are computed incrementally, without recompiling. A relationship between potential crucial literals and Reiter and de Kleer's prime implicants shows that this approach may be useful in incrementally computing and maintaining the prime implicants, as well" ICDE The Architecture of BrAID: A System for Bridging AI/DB Systems. "Amit P. Sheth,Anthony B. O'Hare" 1991 The design of BrAID (a bridge between artificial intelligence and database management systems), an experimental system for the efficient integration of logic-based artificial intelligence (AI) and databases (DB) technologies, is described. Features provided by BrAID include (a) access to conventional DBMSs, (b) support for multiple inferencing strategies, (c) a powerful caching subsystem that manages views and uses subsumption to facilitate the reuse of previously cached data, (d) lazy or eager evaluation of queries submitted by the AI system, and (e) the generation of advice by the AI system to aid in cache management and query execution planning. Some of the key aspects of the BrAID architecture are discussed, focusing on the generation of advice by the AI system and its use by a cache management system to increase efficiency in accessing remote DBMSs through the selective application of such techniques as prefetching, query generalization, result caching, attribute indexing, and lazy evaluation ICDE Lk: A Language for Capturing Real World Meanings of the Stored Data. D. G. Shin 1991 "A knowledge representation language Lk is introduced that is tailored for expressing the real-world meanings of stored data. Lk is developed to achieve a tight coupling between a knowledge base component and the database system. Lk offers (1) a flexible descriptive power which facilitates concepts to be expressed at different levels of granularity; (2) a versatile association mechanism which is capable of linking partially related concepts; and (3) a set of specialization and generalization operators that enable inexact reasoning in a heuristically controlled environment. Examples are provided to illustrate the language's expressive power, its associability, and the inference operations. An example of processing an inference query is given to show the application of various utilities of Lk" ICDE Evaluation of Rule Processing Strategies In Expert Databases. Arie Segev,J. Leon Zhao 1991 Rule processing strategies in expert database systems which involve rules conditional on join results of base relations are studied. In particular, those rules that require very fast response time in their evaluation are considered. It is proposed to materialize the results of firing a rule in a relation, the rule relation. Performance evaluation of several strategies shows that under the clustered B-trees, strategies using pattern relations perform better than those without pattern relations. The strategy with skinny pattern relations performs poorly in comparison to that with bulky pattern relations. The selective bulky pattern strategy performs better than the bulky pattern strategy. The selective pattern strategy outperforms other strategies in terms of expected total cost. However, it always uses more storage space than the direct materialization ICDE Read Optimized File System Designs: A Performance Evaluation. Margo I. Seltzer,Michael Stonebraker 1991 A performance comparison is presented of several file system allocation policies. The file systems are designed to provide high bandwidth between disks and main memory by taking advantage of parallelism in an underlying disk array catering to large units of transfer, and minimizing the bandwidth dedicated to the transfer of metadata. All of the file systems described use a multiblock allocation strategy which allows both large and small files to be allocated efficiently. Simulation results show that these multiblock policies result in systems that are able to utilize a large percentage of the underlying disk bandwidth (more than 90% in sequential cases). As general-purpose systems are called upon to support more data intensive applications such as databases and supercomputing, these policies offer an opportunity to provide superior performance to a larger class of users ICDE Query Pairs as Hypertext Links. Katsumi Tanaka,N. Nishikawa,S. Hirayama,K. Nanba 1991 A new idea is proposed for constructing object-oriented hypertext database systems: query pairs as hypertext links, where each query is defined over a collection of document objects that are classified using a class hierarchy. With this idea, users need not modify their hypertext links against the insertions, deletions and updates of document objects. Also, when a database schema (here, a class hierarchy) evolves, a systematic method can be considered to modify user-defined query-pair links. The TextLink-III system that was developed based on this idea and that is currently running is described. The notable features of TextLink-III are the following: document objects are classified by class hierarchy and the notion of class expression is used for formulating a query for a class hierarchy; multiple viewpoint support; and less link maintenance against data updates ICDE Performance Limits of Two-Phase Locking. Alexander Thomasian 1991 "A novel mean-value analysis method for two-phase locking (2PL) is presented which extends previous work to the important case of variable size transactions. The system performance expressed as the fraction of blocked transactions (β) is determined by solving a cubic equation in β whose coefficients are functions of a single parameter (α), which determines the degree of lock contention in the system. In fact, α is proportional to the mean number of lock requests per transaction (ηc) and additionally the mean waiting time (W1) for a lock held by an active transaction. For α < 0.226 the performance of the system is determined by the smallest root of the cubic and the system is thrashing otherwise, i.e. a large fraction of the transactions in the system are blocked. Validation of the analytic solution against simulation results shows that the analysis is quite accurate up to the point beyond which the system thrashes. It is shown that the variability of transaction size has a major effect on the degree of lock contention, since both ηc and W 1 are affected by this distribution. A theoretical justification for Tay's rule of thumb that ηc should be smaller than 0.7 to avoid thrashing is provided. It is shown that 2PL is susceptible to a cusp catastrophe. Sources of instability are identified, and methods for load control to avoid thrashing are suggested" ICDE Efficiently Maintaining Availability in the Presence of Partitionings in Distributed Systems. Peter Triantafillou,David J. Taylor 1991 A new approach is presented for handling partitionings in replicated distributed databases. Mechanisms are developed through which transactions can access replicated data objects and observe delays similar to nonreplicated systems while enjoying the availability benefits of replication. The replication control protocol, called VELOS, achieves optimal availability, according to a well-known metric, while ensuring one-copy serializability. It is shown to provide better availability than other methods which meet the same optimality criterion. It offers these availability characteristics without relying on system transactions that must execute to restore availability, when failures and recoveries occur, but which introduce significant delays to user transactions ICDE Implementation and Evaluation of a Browsing Algorithm for Design Applications. Yosihisa Udagawa 1991 The implementation and evaluation of an extended relational database called ADAM (advanced database abstraction mechanism) are discussed. A browsing algorithm is developed for composite objects to support reuse of the objects. The algorithm takes advantage of aggregation hierarchies to select related composite objects and to order them based on the two measures, i.e. the relative difference of matching components (RDMC) and the number of nonmatching components (NNC). The algorithm has been implemented and various kinds of experiments have been carried out with a standard integrity constraint database. The results show that the CPU time roughly linearly depends on the number of selected objects. Around 90% of the CPU time is consumed in calculating RDMC and NNC. RDMC is more sensitive to retrieval conditions than NNC ICDE Design Overview of the Aditi Deductive Database System. Jayen Vaghani,Kotagiri Ramamohanarao,David B. Kemp,Zoltan Somogyi,Peter J. Stuckey 1991 An overview of the structure of Aditi, a disk-based deductive database system under continuous development at the University of Melbourne, is presented. The aim of the project is to find out what implementation methods and optimization techniques would make deductive databases competitive with current commercial relational databases. The structure of the Aditi prototype is based on a variant of the client-server model. The front end of Aditi interacts with the user exclusively in a logical language that has more expressive power than relational query languages. The back end uses relational technology for efficiency in the management of disk-based data and uses some optimization algorithms especially developed for the bottom-up evaluation of logical queries involving recursion. The system has been functional for almost two years now, and has already proven its worth as a research tool ICDE How Spacey Can They Get? Space Overhead for Storage and Indexing with Object-Oriented Databases. Mary Jane Willshire 1991 "The impact of physical storage model choice on the performance of an object-oriented database is studied. It is examined how each of six physical storage models' space overhead reacts to changes in database parameters such as directed acrylic graph (DAG) shape, number of instances per class, and distribution of instances over the DAG. Home-class, leaf-overlap, and split-instance physical models consistently require the least storage, whereas repeat-classes, universal-class, and value-triple models require the most. For all models, the depth of the DAG has the strongest impact on space overhead. A set of analytic formulas is developed that allow a database designer to estimate database size for each physical model" ICDE An Object-Oriented Query Processor that Produces Monotonically Improving Approximate Answers. Susan V. Vrbsky,Jane W.-S. Liu 1991 An object-oriented query processor is described that makes approximate answers available if there is not enough time to product an exact answer or if part of the database is unavailable. The accuracy of the approximate result produced improves monotonically with the amount of data retrieved to produce the result. The query-processing algorithm is based on an approximate relational data model and works within a standard relational algebra framework. The query processor maintains an object-oriented view on an underlying level and can be implemented on a relational database system with little change to the relational architecture. It is shown how a monotone query-processing strategy can be implemented, making effective use of semantic information presented by the object-oriented view ICDE An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. Joel L. Wolf,Daniel M. Dias,Philip S. Yu,John Turek 1991 An Effective Algorithm for Parallelizing Hash Joins in the Presence of Data Skew. ICDE Optimal Buffer Partitioning for the Nested Block Join Algorithm. Joel L. Wolf,Balakrishna R. Iyer,Krishna R. Pattipati,John Turek 1991 An efficient, exact algorithm is developed for optimizing the performance of nested block joins. The method uses both dynamic programming and branch-and-bound. In the process of deriving the algorithm, the class of resource allocation problems for which the greedy algorithm applies has been extended. Experiments with this algorithm on extremely large problems show that it is superior to all other known algorithms by a wide margin ICDE Distributed Query Optimization by One-Shot Fixed-Precision Semi-Join Execution. Chihping Wang,Victor O. K. Li,Arbee L. P. Chen 1991 A novel semijoin execution strategy is proposed which allows parallelism and processes multiple semijoins simultaneously. In practice most of the parameters needed for query optimization, such as relation cardinality and selectivity, are of fixed-precision. Imposing this fixed-precision constraint, an efficient distributed query processing algorithm is developed. For situations where the fixed-precision constraint does not apply, a method to truncate the parameters and to use the same algorithm to find near-optimal solutions is proposed. By analyzing the truncation errors, a quantitative comparison between the near-optimal solutions and the optimal ones is provided ICDE First-Order Logic Reducible Programs. Ke Wang,Li-Yan Yuan 1991 "Programs for which the least fixed point exists are considered. A program is first-order logic reducible (FOL-reducible) with respect to a set of integrity constraints if all its valid fixed points are least fixed points. For an FOL-reducible program, a logical assertion about least fixed points is reduced to a logical assertion about all first-order logic models. This makes it possible to characterize, in the first-order logic, some important `all states' properties of programs for which no proof procedures exist in general. This method is applied to the following properties: containment of programs, independence of updates with respect to queries and integrity constraints, and characterization and implication of integrity constraints in programs. It is shown that the transitive closure of a graph if FOL-reducible with respect to the constraint of acyclicity. The `all states' framework requires a modification of the standard treatment of fixed points and completed programs" ICDE A Framework for Schema Updates In An Object-Oriented Database System. Roberto Zicari 1991 "A `reasonable' minimal set of primitives for updating an object-oriented (O2) database schema is defined and the problems which need to be solved in order to obtain a usable schema update mechanism are shown. The distinction between structural and behavioral consistency for the O2 system is described in some detail and it is demonstrated how updates could be performed by invoking an interactive tool. Updates are classified in three categories. Each category is explained in detail" ICDE Optimal Buffer Allocation in A Multi-Query Environment. Philip S. Yu,Douglas W. Cornell 1991 The concepts of memory consumption and return on consumption (ROC) are used as the basis of memory allocations. A global optimization strategy using simulated annealing is developed which minimizes the average response time over all queries under the constraint that the total memory consumption rate has to be less than the buffer size. It selects the optimal join method and memory allocation for all queries simultaneously. By analyzing the way that the optimal strategy makes memory allocations, a heuristic threshold strategy is proposed. The threshold strategy is based on the concept of ROC. As the memory consumption rate by all queries is limited by the buffer size, the strategy tries to allocate the memory so as to make sure that a certain level of ROC is achieved. A simulation model is developed to demonstrate that the heuristic strategy yields performance that is very close to the optimal strategy and is far superior to the conventional allocation strategy ICDE An Evaluation Framework for Algebraic Object-Oriented Query Models. Li Yu,Sylvia L. Osborn 1991 "An evaluation framework consisting of five categories of criteria is developed for evaluating the relative merits of objects algebras, namely, object-orientedness, expressiveness, formalness, performance and database issues. Four recently proposed object algebras are evaluated against these criteria. It is shown that there exists no object algebra that satisfies all the criteria. It is argued that, since some of the criteria may not be compatible, a feasible object algebra has to make some tradeoffs to suit domain-specific needs. It is possible to identify a minimal subset of the criteria. The criterion that an object algebra should support encapsulation seems to be the most important. If an object algebra fails to support this criterion, its semantics is inconsistent with the concept of `data abstraction', which makes a language object-oriented" SIGMOD Conference Database Programming Languages: A Functional Approach. Jurgen Annevelink 1991 Database Programming Languages: A Functional Approach. SIGMOD Conference Objects and Views. Serge Abiteboul,Anthony J. Bonner 1991 Objects and Views. SIGMOD Conference Using Multiversion Data for Non-interfering Execution of Write-only Transactions. Divyakant Agrawal,V. Krishnamurthy 1991 Using Multiversion Data for Non-interfering Execution of Write-only Transactions. SIGMOD Conference Version Management of Composite Objects in CAD Databases. Rafi Ahmed,Shamkant B. Navathe 1991 Version Management of Composite Objects in CAD Databases. SIGMOD Conference Updating Relational Databases through Object-Based Views. Thierry Barsalou,Arthur M. Keller,Niki Siambela,Gio Wiederhold 1991 Updating Relational Databases through Object-Based Views. SIGMOD Conference Spatial Priority Search: An Access Technique for Scaleless Maps. Bruno Becker,Hans-Werner Six,Peter Widmayer 1991 Spatial Priority Search: An Access Technique for Scaleless Maps. SIGMOD Conference Data Caching Tradeoffs in Client-Server DBMS Architectures. Michael J. Carey,Michael J. Franklin,Miron Livny,Eugene J. Shekita 1991 Data Caching Tradeoffs in Client-Server DBMS Architectures. SIGMOD Conference Nested Relation Based Database Knowledge Representation. Qiming Chen,Yahiko Kambayashi 1991 Nested Relation Based Database Knowledge Representation. SIGMOD Conference Effective Clustering of Complex Objects in Object-Oriented Databases. Jia-bing R. Cheng,Ali R. Hurson 1991 Effective Clustering of Complex Objects in Object-Oriented Databases. SIGMOD Conference Trait: An Attribute Management System for VLSI Design Objects. Tzi-cker Chiueh,Randy H. Katz 1991 Trait: An Attribute Management System for VLSI Design Objects. SIGMOD Conference Extracting Concurrency from Objects: A Methodology. Panos K. Chrysanthis,S. Raghuram,Krithi Ramamritham 1991 Extracting Concurrency from Objects: A Methodology. SIGMOD Conference Replica Control in Distributed Systems: An Asynchronous Approach. Calton Pu,Avraham Leff 1991 Replica Control in Distributed Systems: An Asynchronous Approach. SIGMOD Conference Set-Oriented Constructs: From Rete Rule Bases to Database Systems. Douglas N. Gordin,Alexander J. Pasik 1991 Set-Oriented Constructs: From Rete Rule Bases to Database Systems. SIGMOD Conference MMDB Reload Algorithms. Le Gruenwald,Margaret H. Eich 1991 MMDB Reload Algorithms. SIGMOD Conference New Directions For Uncertainty Reasoning In Deductive Databases. Ulrich Güntzer,Werner Kießling,Helmut Thöne 1991 New Directions For Uncertainty Reasoning In Deductive Databases. SIGMOD Conference An Extended Memoryless Inference Control Method: Accounting for Dependence in Table-level Controls. S. C. Hansen,E. A. Unger 1991 An Extended Memoryless Inference Control Method: Accounting for Dependence in Table-level Controls. SIGMOD Conference Error-Constraint COUNT Query Evaluation in Relational Databases. Wen-Chi Hou,Gultekin Özsoyoglu,Erdogan Dogdu 1991 Error-Constraint COUNT Query Evaluation in Relational Databases. SIGMOD Conference Incomplete Objects - A Data Model for Design and Planning Applications. Tomasz Imielinski,Shamim A. Naqvi,Kumar V. Vadaparty 1991 Incomplete Objects - A Data Model for Design and Planning Applications. SIGMOD Conference On the Propagation of Errors in the Size of Join Results. Yannis E. Ioannidis,Stavros Christodoulakis 1991 On the Propagation of Errors in the Size of Join Results. SIGMOD Conference Left-Deep vs. Bushy Trees: An Analysis of Strategy Spaces and its Implications for Query Optimization. Yannis E. Ioannidis,Younkyung Cha Kang 1991 Left-Deep vs. Bushy Trees: An Analysis of Strategy Spaces and its Implications for Query Optimization. SIGMOD Conference A Retrieval Technique for Similar Shapes. H. V. Jagadish 1991 A Retrieval Technique for Similar Shapes. SIGMOD Conference Towards a Multilevel Secure Relational Data Model. Sushil Jajodia,Ravi S. Sandhu 1991 Towards a Multilevel Secure Relational Data Model. SIGMOD Conference Efficient Assembly of Complex Objects. Thomas Keller,Goetz Graefe,David Maier 1991 Efficient Assembly of Complex Objects. SIGMOD Conference Function Materialization in Object Bases. Alfons Kemper,Christoph Kilger,Guido Moerkotte 1991 Function Materialization in Object Bases. SIGMOD Conference Segment Indexes: Dynamic Indexing Techniques for Multi-Dimensional Interval Data. Curtis P. Kolovson,Michael Stonebraker 1991 Segment Indexes: Dynamic Indexing Techniques for Multi-Dimensional Interval Data. SIGMOD Conference Language Features for Interoperability of Databases with Schematic Discrepancies. Ravi Krishnamurthy,Witold Litwin,William Kent 1991 Language Features for Interoperability of Databases with Schematic Discrepancies. SIGMOD Conference Fully Persistent B+-trees. Sitaram Lanka,Eric Mays 1991 Fully Persistent B+-trees. SIGMOD Conference Computers versus Common Sense. Douglas B. Lenat 1991 Computers versus Common Sense. SIGMOD Conference Starburst II: The Extender Strikes Back! Guy M. Lohman,George Lapis,Tobin J. Lehman,Rakesh Agrawal,Roberta Cochrane,John McPherson,C. Mohan,Hamid Pirahesh,Jennifer Widom 1991 Starburst II: The Extender Strikes Back! SIGMOD Conference An Optimistic Commit Protocol for Distributed Transaction Management. Eliezer Levy,Henry F. Korth,Abraham Silberschatz 1991 An Optimistic Commit Protocol for Distributed Transaction Management. SIGMOD Conference LLO: An Object-Oriented Deductive Language with Methods and Method Inheritance. Yanjun Lou,Z. Meral Özsoyoglu 1991 LLO: An Object-Oriented Deductive Language with Methods and Method Inheritance. SIGMOD Conference Optimization and Evaluation of Database Queries Including Embedded Interpolation Procedures. Leonore Neugebauer 1991 Optimization and Evaluation of Database Queries Including Embedded Interpolation Procedures. SIGMOD Conference Flexible Buffer Allocation Based on Marginal Gains. Raymond T. Ng,Christos Faloutsos,Timos K. Sellis 1991 Flexible Buffer Allocation Based on Marginal Gains. SIGMOD Conference HYDRO: A Heterogeneous Distributed Database System. William Perrizo,Joseph Rajkumar,Prabhu Ram 1991 HYDRO: A Heterogeneous Distributed Database System. SIGMOD Conference Glue-Nail: A Deductive Database System. Geoffrey Phipps,Marcia A. Derr,Kenneth A. Ross 1991 Glue-Nail: A Deductive Database System. SIGMOD Conference Aspects: Extending Objects to Support Multiple, Independent Roles. Joel E. Richardson,Peter M. Schwarz 1991 Aspects: Extending Objects to Support Multiple, Independent Roles. SIGMOD Conference Multi-Disk B-trees. Bernhard Seeger,Per-Åke Larson 1991 Multi-Disk B-trees. SIGMOD Conference A Non-deterministic Deductive Database Language. Yeh-Heng Sheng 1991 A Non-deterministic Deductive Database Language. SIGMOD Conference K: A High-Level Knowledge Base Programming Language for Advanced Database Applications. Yuh-Ming Shyy,Stanley Y. W. Su 1991 K: A High-Level Knowledge Base Programming Language for Advanced Database Applications. SIGMOD Conference Performance of B-Tree Concurrency Algorithms. V. Srinivasan,Michael J. Carey 1991 Performance of B-Tree Concurrency Algorithms. SIGMOD Conference Are Standards the Panacea for Heterogeneous Distributed DBMSs? Clenn Thompson 1991 Are Standards the Panacea for Heterogeneous Distributed DBMSs? SIGMOD Conference Managing Persistent Objects in a Multi-Level Store. Michael Stonebraker 1991 This paper presents an architecture for a persistent object store in which multi-level storage is explicitly included. Traditionally. DBMSs have assumed that all accessible data resides on magnetic disk, and recently several researchers have begun to consider the possibility that significant amounts of data will occupy space m a main memory cache. We feel that object bases in which time critical objects reside in main memory, other objects are disk resident, and the remainder occupy tertiary memory. Moreover, it is possible that more than three levels will be present, and that some of these levels will be on remote hardware. This paper contains an architectural proposal addressing these needs along with a sketch of the required query optimizer. SIGMOD Conference Space Optimization in the Bottom-Up Evaluation of Logic Programs. S. Sudarshan,Divesh Srivastava,Raghu Ramakrishnan,Jeffrey F. Naughton 1991 Space Optimization in the Bottom-Up Evaluation of Logic Programs. SIGMOD Conference A Stochastic Approach for Clustering in Object Bases. Manolis M. Tsangaris,Jeffrey F. Naughton 1991 A Stochastic Approach for Clustering in Object Bases. SIGMOD Conference Algebraic Support for Complex Objects with Arrays, Identity, and Inheritance. Scott L. Vandenberg,David J. DeWitt 1991 Algebraic Support for Complex Objects with Arrays, Identity, and Inheritance. SIGMOD Conference Cache Consistency and Concurrency Control in a Client/Server DBMS Architecture. Yongdong Wang,Lawrence A. Rowe 1991 Cache Consistency and Concurrency Control in a Client/Server DBMS Architecture. SIGMOD Conference Dynamic File Allocation in Disk Arrays. Gerhard Weikum,Peter Zabback,Peter Scheuermann 1991 Dynamic File Allocation in Disk Arrays. SIGMOD Conference Incremental Evaluation of Rules and its Relationship to Parallelism. Ouri Wolfson,Hasanat M. Dewan,Salvatore J. Stolfo,Yechiam Yemini 1991 Incremental Evaluation of Rules and its Relationship to Parallelism. VLDB Management Of Schema Evolution In Databases. José Andany,Michel Léonard,Carole Palisser 1991 Management Of Schema Evolution In Databases. VLDB Optimization for Spatial Query Processing. Walid G. Aref,Hanan Samet 1991 Optimization for Spatial Query Processing. VLDB A Relationship Mechanism for a Strongly Typed Object-Oriented Database Programming Language. Antonio Albano,Giorgio Ghelli,Renzo Orsini 1991 A Relationship Mechanism for a Strongly Typed Object-Oriented Database Programming Language. VLDB On Maintaining Priorities in a Production Rule System. Rakesh Agrawal,Roberta Cochrane,Bruce G. Lindsay 1991 On Maintaining Priorities in a Production Rule System. VLDB Algebraic Properties of Bag Data Types. Joseph Albert 1991 Algebraic Properties of Bag Data Types. VLDB A Model for Active Object Oriented Databases. Catriel Beeri,Tova Milo 1991 A Model for Active Object Oriented Databases. VLDB An Iterative Method for Distributed Database Design. Rex Blankinship,Alan R. Hevner,S. Bing Yao 1991 An Iterative Method for Distributed Database Design. VLDB Logic Programming Environments for Large Knowledge Bases: A Practical Perspective (Abstract). Jorge B. Bocca 1991 Logic Programming Environments for Large Knowledge Bases: A Practical Perspective (Abstract). VLDB Semantic Modeling of Object Oriented Databases. Mokrane Bouzeghoub,Elisabeth Métais 1991 Semantic Modeling of Object Oriented Databases. VLDB Effects of Database Size on Rule System Performance: Five Case Studies. David A. Brant,Timothy Grose,Bernie J. Lofaso,Daniel P. Miranker 1991 Effects of Database Size on Rule System Performance: Five Case Studies. VLDB Interoperability In Multidatabases: Semantic and System Issues (Panel). Yuri Breitbart,Hector Garcia-Molina,Witold Litwin,Nick Roussopoulos,Hans-Jörg Schek,Gio Wiederhold 1991 Interoperability In Multidatabases: Semantic and System Issues (Panel). VLDB Deriving Production Rules for Incremental View Maintenance. Stefano Ceri,Jennifer Widom 1991 Deriving Production Rules for Incremental View Maintenance. VLDB Kaleidoscope Data Model for An English-like Query Language. Sang Kyun Cha,Gio Wiederhold 1991 Kaleidoscope Data Model for An English-like Query Language. VLDB Dynamic Constraints and Object Migration. Jianwen Su 1991 Dynamic Constraints and Object Migration. VLDB A Temporal Knowledge Representation Model OSAM*/T and Its Query Language OQL/T. Stanley Y. W. Su,Hsin-Hsing M. Chen 1991 A Temporal Knowledge Representation Model OSAM*/T and Its Query Language OQL/T. VLDB A Formalism for Extended Transaction Model. Panos K. Chrysanthis,Krithi Ramamritham 1991 A Formalism for Extended Transaction Model. VLDB Rule Management in Object Oriented Databases: A Uniform Approach. Oscar Díaz,Norman W. Paton,Peter M. D. Gray 1991 Rule Management in Object Oriented Databases: A Uniform Approach. VLDB Active Database Systems (Abstract). Umeshwar Dayal,Klaus R. Dittrich 1991 Active Database Systems (Abstract). VLDB A Transactional Model for Long-Running Activities. Umeshwar Dayal,Meichun Hsu,Rivka Ladin 1991 A Transactional Model for Long-Running Activities. VLDB A Methodology for the Design and Transformation of Conceptual Schemas. Christoph F. Eick 1991 A Methodology for the Design and Transformation of Conceptual Schemas. VLDB An Evaluation of Non-Equijoin Algorithms. David J. DeWitt,Jeffrey F. Naughton,Donovan A. Schneider 1991 An Evaluation of Non-Equijoin Algorithms. VLDB Conecptual Modeling Using and Extended E-R Model (Abstract). Ramez Elmasri 1991 Conecptual Modeling Using and Extended E-R Model (Abstract). VLDB Cooperative Access to Data and Knowledge Baes (Abstract). Robert Demolombe 1991 Cooperative Access to Data and Knowledge Baes (Abstract). VLDB The Power of Methods With Parallel Semantics. Karl Denninghoff,Victor Vianu 1991 The Power of Methods With Parallel Semantics. VLDB Predictive Load Control for Flexible Buffer Allocation. Christos Faloutsos,Raymond T. Ng,Timos K. Sellis 1991 Predictive Load Control for Flexible Buffer Allocation. VLDB Ode as an Active Database: Constraints and Triggers. Narain H. Gehani,H. V. Jagadish 1991 Ode as an Active Database: Constraints and Triggers. VLDB Optimizing Random Retrievals from CLV format Optical Disks. Daniel Alexander Ford,Stavros Christodoulakis 1991 Optimizing Random Retrievals from CLV format Optical Disks. VLDB Object Placement in Parallel Hypermedia Systems. Shahram Ghandeharizadeh,Luis Ramos,Zubair Asad,Waheed Qureshi 1991 Object Placement in Parallel Hypermedia Systems. VLDB Temporal Logic & Historical Databases. Dov M. Gabbay,Peter McBrien 1991 Temporal Logic & Historical Databases. VLDB Semantic Queries with Pictures: The VIMSYS Model. Amarnath Gupta,Terry E. Weymouth,Ramesh Jain 1991 Semantic Queries with Pictures: The VIMSYS Model. VLDB A Performance Evaluation of Multi-Level Transaction Management. Christof Hasse,Gerhard Weikum 1991 A Performance Evaluation of Multi-Level Transaction Management. VLDB Adaptive Load Control in Transaction Processing Systems. Hans-Ulrich Heiss,Roger Wagner 1991 Adaptive Load Control in Transaction Processing Systems. VLDB Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning. Kien A. Hua,Chiang Lee 1991 Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning. VLDB Experimental Evaluation of Real-Time Optimistic Concurrency Control Schemes. Jiandong Huang,John A. Stankovic,Krithi Ramamritham,Donald F. Towsley 1991 Owing to its potential for a high degree of parallelism, optimistic concurrency control is expected to perform better than two-phase locking when integrated with priority-driven CPU scheduling in real-time database systems. In this paper, we examine the overall effects and the impact of the overheads involved in implementing real-time optimistic concurrency control. Using a locking mech- anism to ensure the correctness of the implementation, we develop a set of optimistic concurrency control protocols which possess the properties of deadlock freedom and a high degree of paral- lelism. Through experiments, we investigate, in depth, the effect of the locking mechanism on the performance of optimistic concurrency control protocols. We show that due to blocking, the performance of the protocols is sensitive to priority inversions but not to resource utilization. Fur- ther, in contrast to recent simulation studies, our experimental results show that with respect to meeting transaction deadlines, the optimistic approach may not always outperform the two-phase locking scheme which aborts the lower priority transaction to resolve a conflict. We also show that integrated with a weighted priority scheduling algorithm, optimistic concurrency control exhibits greater flexibility in coping with the starvation problem (for longer transactions) than two-phase locking. Our performance studies indicate that the physical implementation has a significant impact on the performance of real-time concurrency control protocols and is hence an important aspect in the study of concurrency control. VLDB Language Constructs for Programming Active Databases. Richard Hull,Dean Jacobs 1991 Language Constructs for Programming Active Databases. VLDB Adaptive Locking Strategies in a Multi-node Data Sharing Environment. Ashok M. Joshi 1991 Adaptive Locking Strategies in a Multi-node Data Sharing Environment. VLDB "Database Technologies for the 90's and Beyond (Panel)." Magdi N. Kamel,Umeshwar Dayal,Rakesh Agrawal,Douglas Tolbert,Gilbert Vidal 1991 "Database Technologies for the 90's and Beyond (Panel)." VLDB Data and Knowledge Bases for Genome Mapping: What Lies Ahead? (Panel). Nabil Kamel,M. Delobel,Thomas G. Marr,Robert Robbins,Jean Thierry-Mieg,Akira Tsugita 1991 Data and Knowledge Bases for Genome Mapping: What Lies Ahead? (Panel). VLDB Solving Domain Mismatch and Schema Mismatch Problems with an Object-Oriented Database Programming Language. William Kent 1991 Solving Domain Mismatch and Schema Mismatch Problems with an Object-Oriented Database Programming Language. VLDB Extending the Search Strategy in a Query Optimizer. Rosana S. G. Lanzelotte,Patrick Valduriez 1991 Extending the Search Strategy in a Query Optimizer. VLDB Optimization of Multi-Way Join Queries for Parallel Execution. Hongjun Lu,Ming-Chien Shan,Kian-Lee Tan 1991 Optimization of Multi-Way Join Queries for Parallel Execution. VLDB Safe Referential Structures in Relational Databases. Victor M. Markowitz 1991 Safe Referential Structures in Relational Databases. VLDB Recovery and Coherency-Control Protocols for Fast Intersystem Page Transfer and Fine-Granularity Locking in a Shared Disks Transaction Environment. C. Mohan,Inderpal Narang 1991 Recovery and Coherency-Control Protocols for Fast Intersystem Page Transfer and Fine-Granularity Locking in a Shared Disks Transaction Environment. VLDB Integrity Constraints Checking In Deductive Databases. Antoni Olivé 1991 Integrity Constraints Checking In Deductive Databases. VLDB Performance Analysis of a Load Balancing Hash-Join Algorithm for a Shared Memory Multiprocessor. Edward Omiecinski 1991 Performance Analysis of a Load Balancing Hash-Join Algorithm for a Shared Memory Multiprocessor. VLDB Distributed Database Management: Current State-of-the-Art, Unsolved Problems, New Issues (Abstract). M. Tamer Özsu 1991 Distributed Database Management: Current State-of-the-Art, Unsolved Problems, New Issues (Abstract). VLDB Fido: A Cache That Learns to Fetch. Mark Palmer,Stanley B. Zdonik 1991 "Accurately fetching data objects or pages in advance of their use is a powerful means of improving performance, but this capability has been difficult to realize. Current OODBs maintain object caches that employ fetch and replacement policies derived from those used for virtual-memory demand paging. These policies usually assume no knowledge of the future. Object cache managers often employ demand fetching combined with data clustering to effect prefetching, but cluster prefetching can be ineffective when the access patterns serviced are incompatible. This paper describes FIDO, an experimental {\em predictive cache} that predicts access for individuals during a session by employing an associative memory to assimilate regularities in the access pattern of an individual over time. By dint of continual training, the associative memory adapts to changes in the database and in the user''s access pattern, enabling on-line access predictions for prefetching. We discuss two salient components of Fido: \begin{enumerate} \item MLP, a replacement policy for managing pre-fetched objects. \item Estimating Prophet, an associative memory that recognizes patterns in access sequences adaptively over time and provides on-line predictions used for prefetching. \end{enumerate} We then present some early simulation thatts which suggest that predictive caching works well, especially for sequential access patterns, and conclude that predictive caching holds great promise." VLDB A Functional Programming Approach to Deductive Databases. Alexandra Poulovassilis,Carol Small 1991 A Functional Programming Approach to Deductive Databases. VLDB Real-Time Databases (Panel). Krithi Ramamritham,Sang Hyuk Son,Alejandro P. Buchmann,Klaus R. Dittrich,C. Mohan 1991 Real-Time Databases (Panel). VLDB A Framework for Automating Physical Database Design. Steve Rozen,Dennis Shasha 1991 A Framework for Automating Physical Database Design. VLDB Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS. Ulf Schreier,Hamid Pirahesh,Rakesh Agrawal,C. Mohan 1991 Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS. VLDB Data Management for Large Rule Systems. Arie Segev,J. Leon Zhao 1991 Data Management for Large Rule Systems. VLDB Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. Amit P. Sheth 1991 Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. VLDB A Metadata Approach to Resolving Semantic Conflicts. Michael Siegel,Stuart E. Madnick 1991 A Metadata Approach to Resolving Semantic Conflicts. VLDB Integrating Implicit Answers with Object-Oriented Queries. Hava T. Siegelmann,B. R. Badrinath 1991 Integrating Implicit Answers with Object-Oriented Queries. VLDB Cooperative Database Design (Panel). Stefano Spaccapietra,Shamkant B. Navathe,Erich J. Neuhold,Amit P. Sheth 1991 Cooperative Database Design (Panel). VLDB Aggregation and Relevance in Deductive Databases. S. Sudarshan,Raghu Ramakrishnan 1991 Aggregation and Relevance in Deductive Databases. VLDB Using Write Protected Data Structures To Improve Software Fault Tolerance in Highly Available Database Management Systems. Mark Sullivan,Michael Stonebraker 1991 Using Write Protected Data Structures To Improve Software Fault Tolerance in Highly Available Database Management Systems. VLDB A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. Christopher B. Walton,Alfred G. Dale,Roy M. Jenevein 1991 A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. VLDB Implementing Set-Oriented Production Rules as an Extension to Starburst. Jennifer Widom,Roberta Cochrane,Bruce G. Lindsay 1991 Implementing Set-Oriented Production Rules as an Extension to Starburst. VLDB Efficiency of Nested Relational Document Database Systems. Justin Zobel,James A. Thom,Ron Sacks-Davis 1991 Efficiency of Nested Relational Document Database Systems. SIGMOD Record Temporal Relations in Geographic Information Systems: A Workshop at the University of Maine. Renato Barrera,Andrew U. Frank,Khaled K. Al-Taha 1991 "A workshop on temporal relations in Geographic Information Systems (GIS) was held on October 12-13, 1990, at the University of Maine. The meeting, sponsored by the National Center for Geographic Information and Analysis (NCGIA), gathered specialists on Geography, GIS, and Computer Science to discuss users' requirements of temporal GIS and to identify the corresponding research issues." SIGMOD Record A Complete Identity Set for Codd Algebras. H. W. Buff 1991 A Complete Identity Set for Codd Algebras. SIGMOD Record Data Manipulation in Heterogeneous Databases. Abhirup Chatterjee,Arie Segev 1991 Many important information systems applications require access to data stored in multiple heterogeneous databases. This paper examines a problem in interdatabase data manipulation within a heterogeneous environment, where conventional techniques are no longer useful. To solve the problem, a broader definition for join operator is proposed. Also, a method to probabilistically estimate the accuracy of the join is discussed. SIGMOD Record The Indiana Center for Database Systems. Judith Copler 1991 The Indiana Center for Database Systems. SIGMOD Record Minimal Covers Revisited: Correct and Efficient Algorithms. Jim Diederich 1991 In [1] Nummenmaa and Tanisch show that the algorithm in [2] for computing minimal covers is incorrect even though it purports to correct the algorithms in [3-6]. As they illustrate with F = { A B → C}, the algorithm in [2] allows B to be eliminated as an extraneous attribute since the dependency AB → C is implied by A → C using augmentation. Thus F is replaced by F′ = [A → C], which is clearly not equivalent to F. The problematic step of the algorithm in [2] that allows this to occur is Consider each dependency X → A in some order. If Z is a subset of X such that F is contained in the closure of (F - {X → A}) ∪ {Z → A], then immediately replace X → A by Z → A in F. This step continues until no left side of any dependency in F can be reduced SIGMOD Record Interoperability and Object Identity. Frank Eliassen,Randi Karlsen 1991 Data model transparency can be achieved by providing a canonical language format for the definition and seamless manipulation of multiple autonomous information bases. In this paper we assume a canonical data and computational model combining the function and object-oriented paradigms. We investigate the concept of identity as a property of an object and the various ways this property is supported in existing databases, in relation to the object-oriented canonical data model. The canonical data model is the tool for combining and integrating preexisting syntactical homogeneous, but semantical heterogeneous data types into generalized unifying data types. We identify requirements for object identity in federated systems, and discuss problems of object identity and semantical object replication arising from this new abstraction level. We argue that a strong notion of identity at the federated level can only be acheived by weakening strict autonomy requirements of the component information bases. Finally we discuss various solutions to this problem that differ in their requirements with repect to giving up autonomy. SIGMOD Record Technique for Universal Quantification in SQL. Claudio Fratarcangeli 1991 Universal quantification is expressed in ANSI SQL with negated existential quantification because there is no direct support for universal quantification in ANSI SQL. However, the lack of explicit support for universal quantification diminishes the userfriendliness of the language because some queries are expressed more naturally using universal quantification than they are using negated existential quantification. It is the intent of this paper to describe a technique to facilitate the construction of universal quantification queries in ANSI SQL. The technique is based upon a proposed extension to ANSI SQL to incorporate explicit general support for universal quantification. SIGMOD Record The Metadatabase Project at Rensselaer. Cheng Hsu 1991 "The Metadatabase project is a multi-year research effort at Rensselaer Polytechnic Institute. Sponsored by industry (ALCOA, DEC, GE, CM, IBM and other) through Rensselaer's Computer Integrated Manufacturing Program, this project seeks to develop novel concepts, methods and techniques for achieving information integration across major functional systems pertaining to computerized manufacturing enterprises. Thus, the metadatabase model emcompasses the generic tasks of heterogeneous, distributed, aand autonomous databases administration, but also includes information resources management and integration of concurrent (functional) systems. The model entails (1) an integrated data and knowledge modeling and representation method; (2) an online kernel (the metadatabase) for information modeling and management; (3) metadatabase assisted global query formulation and processing; (4) a concurrent architectural whereby global synergies are achieved through (distributed) metadata management rather than synchronization of (distributed) database processing; and (5) a theory of information requirements for integration. A metadatabase prototype was recently demonstrated to the industrial sponsors. The basic concept of the metadatabase model is discussed in this paper." SIGMOD Record Semantic vs. Structural Resemblance of Classes. Peter Fankhauser,Martin Kracker,Erich J. Neuhold 1991 We present an approach to determine the similarity of classes which utilized fuzzy and incomplete terminological knowledge together with schema knowledge. We clearly distinguish between semantic similarity determining the degree of resemblance according to real world semantics, and structural correspondence explaining how classes can actually be interrelated. To compute the semantic similarity we introduce the notion of semantic relevance and apply fuzzy set theory to reason about both terminological knowledge and schema knowledge. SIGMOD Record On the Semantic Equivalence of Heterogeneous Representations in Multimodel Multidatabase Systems. Dipayan Gangopadhyay,Thierry Barsalou 1991 On the Semantic Equivalence of Heterogeneous Representations in Multimodel Multidatabase Systems. SIGMOD Record The Design of the Triton Nested Relational Database System. Tina M. Harvey,Craig W. Schnepf,Mark A. Roth 1991 Unique database requirements of applications such as computer-aided design (CAD), computer-aided software engineering(CASE), and office information systems(OIC) have driven the development of new data models and database systems based on these new models. In particular, the goal of these new database systems is to exploit the advantages of complex data models that are more efficient (in terms of time and space) than their relational counterparts. In this paper, we describe the design and implementation of the Triton nested relational database system, a prototype system based on the nested relational data model. Triton is intended to be used as the backend storage and access component of the aforementioned applications. This paper describes the architecture of the Triton system, and compares the performance of the nested relational model versus the relational model using Triton. In addition, this paper evaluates the EXODUS extensible database toolkit used in the development of the Triton system including key features of the persistent programming language E and the EXODUS storage manager. SIGMOD Record Structure and Semantics in OODB Class Specifications. James Geller,Yehoshua Perl,Erich J. Neuhold 1991 A class specification contains both structural aspects and semantic aspects. We introduce a mathematically based distinction between structural and semantic aspects. We show how this distinction is used to identify all structural aspects of a class specification to be included in the object type of a class. The model obtained is called the Dual Model due to the separation of structure and semantics in the class specification. Advantages of the separation of structure and semantics have been discussed in previous papers and include separate hierarchies for structural and semantic aspects, refined inheritance mechanisms, support of physical database design and structural integration which is impossible in other models. SIGMOD Record Handling Missing Data by Using Stored Truth Values. G. H. Gessert 1991 This paper proposes a method for handling inapplicable and unknown missing data. The method is based on: (1) storing default values (instead of null values) in place of missing data, (2) storing truth values that describe the logical status of the default values in corresponding fields of corresponding tables. Four valued logic is used so that the logical status of the default data values can be described as, not just true or false, but also as inapplicable or unknown. This method, in contrast to the “hidden byte” approach, has two important advantages: (1) Because the logical status of all data is represented explicitly in tables, all 4-valued operations can be handled via a 2-valued data manipulation language, such as SQL. Language extensions for handling missing data (e.g., “IS NULL”) are not necessary. (2) Because data fields always contain a default value (as opposed to a null value or mark), it is possible to do arithmetic across missing data and to interpret the logical status of the result by means of logical operations on the corresponding stored truth values. SIGMOD Record Database Research at the IBM Almaden Research Center. Laura M. Haas,Patricia G. Selinger 1991 Database Research at the IBM Almaden Research Center. SIGMOD Record An Anatomy of the Information Resource Semantic Abstraction. Leonid A. Kalinichenko 1991 Semantic abstraction mapping establishing a correspondence between an information resource and an application is considered to be a basic notion providing for the study of semantic interoperability of heterogeneous information resources. The structure and necessary properties of a resource class application abstraction are considered. Intensional model-based properties of the abstraction mapping are introduced as provable conditions of a class assertion abstraction and an operation concretization SIGMOD Record Continuous Media Data Manegement. Kyoji Kawagoe 1991 Continuous Media Data Manegement. SIGMOD Record The Breakdown of the Information Model in Multi-Database Systems. William Kent 1991 The Breakdown of the Information Model in Multi-Database Systems. SIGMOD Record First Order Normal Form for Relational Databases and Multidatabases. Witold Litwin,Mohammad A. Ketabchi,Ravi Krishnamurthy 1991 First Order Normal Form for Relational Databases and Multidatabases. SIGMOD Record Research Directions in Knowledge Discovery. Ravi Krishnamurthy,Tomasz Imielinski 1991 Research Directions in Knowledge Discovery. SIGMOD Record Bibliography on Temporal Databases. Michael D. Soo 1991 Bibliography on Temporal Databases. SIGMOD Record The Third-Generation/OODBMS Manifesto, Commercial version. Frank Manola 1991 The Third-Generation/OODBMS Manifesto, Commercial version. SIGMOD Record Database Research at HP Labs. Marie-Anne Neimat,Ming-Chien Shan 1991 Database Research at HP Labs. SIGMOD Record A Functional Model for Macro-Databases. Maurizio Rafanelli,Fabrizio L. Ricci 1991 Recently there have been numerous proposals aimed at correcting the deficiency in existing database models to manipulate macro data (such as summary tables). The authors propose a new functional model, Mefisto, based on the definition of a new data structure, the “statistical entity”, and on a set of operations capable of manipulating this data structure by operating at metadata level. SIGMOD Record Research Areas Related to Practical Problems in Automated Database Design. David S. Reiner 1991 Research Areas Related to Practical Problems in Automated Database Design. SIGMOD Record research at Altaïr. Philippe Richard 1991 research at Altaïr. SIGMOD Record Some Further Analysis of the Essential Blocking Recurrence. John T. Robinson 1991 In previous work,1 a random graph model of concurrency control was developed, in which there are n concurrent transactions (represented as vertices of a graph), and where for each pair of transactions (vertices), a conflict (edge between the vertices) initially occurs independently with probability p. Given such a graph, a scheduling function selects a subset of transactions to complete based on the conflicts and possibly on an ordering of the transactions. At the end of each unit of time, the transactions selected by the scheduling function to complete are removed from the graph, and are replaced by new transactions from a transaction sequence. For each pair of transactions in the new graph, if the transactions were both in the previous graph then a conflict occurs between them if and only if there was a conflict in the previous graph; otherwise one or both of the transactions are new and a conflict occurs independently with probability p. SIGMOD Record Modern Client-Server DBMS Architectures. Nick Roussopoulos,Alex Delis 1991 In this paper, we describe three Client-Server DBMS architectures. We discuss their functional components and provide an overview of their performance characteristics. SIGMOD Record Suitability of Data Models as Canonical Models for Federated Databases. Fèlix Saltor,Malú Castellanos,Manuel García-Solaco 1991 Suitability of Data Models as Canonical Models for Federated Databases. SIGMOD Record Spatial Database Access Methods. Betty Salzberg,David B. Lomet 1991 In the discussion of research issues in spatial databases (SIGMOD Record vol. 19, no. 4, Dec 1990) we stated the need for a robust framework for analytical comparison of a broad range of spatial access methods. The utility of such a comparison, even of very closely related access methods, was shown in [FALO87]. A necessary precondition for a meaningful analytical comparison is the existence of strong analytical results for individual access methods. In the following paper, Salzberg and Lomet take the worst case analytical results on fan-out and average storage utilization they obtained for their hB-tree [LOME89,LOME90] and extend the analysis to another robust method, Z-order encoding [OREN84]. We think this paper is a start on the comparative assessment of access methods based on analytical results. We hope to see future work extend the framework beyond worst case analysis, and to other access methods as well. SIGMOD Record Conflicts and Correspondence Assertions in Interoperable Databases. Stefano Spaccapietra,Christine Parent 1991 Conflicts and Correspondence Assertions in Interoperable Databases. SIGMOD Record Semantic Issues in Multidatabase Systems - Preface by the Special Issue Editor. Amit P. Sheth 1991 Semantic Issues in Multidatabase Systems - Preface by the Special Issue Editor. SIGMOD Record Context Interchange: Sharing the Meaning of Data. Michael Siegel,Stuart E. Madnick 1991 Context Interchange: Sharing the Meaning of Data. SIGMOD Record An Outline of MQL. Victor J. Streeter 1991 An Outline of MQL. SIGMOD Record A Note on the Strategy Space of Multiway Join Query Optimization Problem in Parallel Systems. Kian-Lee Tan,Hongjun Lu 1991 In this short note, we estimate the search space of optimizing multiway join queries in multiprocessor computer systems, i.e. the number of possible query execution plans that need to be considered. SIGMOD Record A Modular and Open Object-Oriented Database System. Satish M. Thatte 1991 "On 30 August 1990, Texas Instruments Incorporated, Dallas, TX was awarded a three year contract (Contract No. DAAB07-90-C-B920) to develop a modular and open object-oriented database system. The contract is funded by DARPA/ISTO and is being managed by the U.S. Army, CECOM, Fort Monmouth, N.J. The contract is being executed at TI's Information Technologies Laboratory, Computer Science Center, Dallas, Texas. So far, we have received an outstanding response from interested parties (database research community, OODB application developers, OODB builders) to our contract award announcement. This communication is a collection of most commonly asked questions and answers to them." SIGMOD Record Practitioner Problems in Need of Database Research. Gomer Thomas 1991 The bottlenecks between research and product development are well known. It typically takes a very long time for ideas coming out of research labs to make their way into products, and developers often face practical problems which do not seem to be addressed by currently available research results. The term “technology transfer” is often used to describe the process of overcoming these bottlenecks. SIGMOD Record Centralized Concurrency Control Methods for High-End TP. Alexander Thomasian 1991 Centralized Concurrency Control Methods for High-End TP. SIGMOD Record Some Recent Developments in Deductive Databases. Shalom Tsur 1991 Some Recent Developments in Deductive Databases. SIGMOD Record Resolving Semantic Heterogeneity Through the Explicit Representation of Data Model Semantics. Susan Darling Urban,Jian Wu 1991 Resolving Semantic Heterogeneity Through the Explicit Representation of Data Model Semantics. SIGMOD Record Semantic Heterogeneity as a Result of Domain Evolution. Vincent Ventrone,Sandra Heiler 1991 Semantic Heterogeneity as a Result of Domain Evolution. SIGMOD Record Bibliography on Object-Oriented Database Management. Gottfried Vossen 1991 Bibliography on Object-Oriented Database Management. SIGMOD Record Data/Knowledge Packets as a Means of Supporting Semantic Heterogeneity in Multidatabase Systems. Doyle Weishar,Larry Kerschberg 1991 Semantic heterogeneiry in heterogeneous autonomous databases poses problems in instance matches, units conversion (value interpretation), contextual and structural mismatches, etc. In this work we examine some of the research issues in semantic heterogeniety and propose a novel architecture for resolving such problems. The approach involves the use of Artifical Intelligence tools and techniques to construct “domain models,” that is data and knowledge representations of the constituent databases and an overall domain model of the semantic interactions among the databases. These domain models are represented as knowledge sources (KSs) in a blackboard architecture. This architecture lends itself to an opportunistic approach to query processing and goal-directed problem solving. We introduce the notion of Data/Knowledge Packets as a means of supporting both operational and structural semantic heterogeneity. SIGMOD Record Funding for Small US Businesses and from DARPA and NASA. Marianne Winslett 1991 This column is the first of a regular series describing database funding programs in the United States and abroad. I plan to include profiles of major funding agencies and programs, major efforts underway, new calls for proposals, and the outcomes of funding initiatives. I welcome submissions of relevant material; they can be sent to winslett@cs.uiuc.edu. In this issue, I describe funding in the database area for innovative research in small US businesses, and new requests for proposals from DARPA and from NASA. SIGMOD Record User Surveys on Database Research Needs in Finland. Antoni Wolski 1991 User Surveys on Database Research Needs in Finland. SIGMOD Record Semantic Heterogeneity in Distributed Geographic Databases. Michael F. Worboys,S. Misbah Deen 1991 This paper considers the special problems of semantic heterogeneity in a distributed system of databases containing spatially referenced information. Two forms of semantic heterogeneity are defined. Generic semantic heterogeneity arises when nodes are using different generic conceptual models of the spatial information. Contextual semantic heterogeneity is caused by the particular local environmental conditions at nodes. It is contextual heterogeneity which is especially the consideration with geographic databases and to which the paper devotes most attention. Two possible solutions are proposed, one founded on transforming processors between models and a second using a canonical model which is a generalization of existing generic spatial models. SIGMOD Record Determining Relationships among Names in Heterogeneous Databases. Clement T. Yu,Biao Jia,Wei Sun,Son Dao 1991 Determining Relationships among Names in Heterogeneous Databases. SIGMOD Record A More General Model For Handlinh Missing Information In Relational DataBases Using A 3-Valued Logic. Kwok-bun Yue 1991 Codd proposed the use of two interpretations of nulls to handle missing information in relational databases that may lead to a 4-valued logic [Codd86, Codd87]. In a more general model, three interpretations of nulls are necessary [Roth, Zani]. Without simplification, this may lead to a 7-valued logic, which is too complicated to be adopted in relational databases. For such a model, there is no satisfactory simplification to a 4-valued logic. However, by making a straightforward simplification and using some proposed logical functions, a 3-valued logic can handle all three interpretations. ICDE Knowledge Mining by Imprecise Querying: A Classification-Based Approach. Tarek M. Anwar,Howard W. Beck,Shamkant B. Navathe 1992 Knowledge mining is the process of discovering knowledge that is hitherto unknown. An approach to knowledge mining by imprecise querying that utilizes conceptual clustering techniques is presented. The query processor has both a deductive and an inductive component. The deductive component finds precise matches in the traditional sense, and the inductive component identifies ways in which imprecise matches may be considered similar. Ranking on similarity is done by using the database taxonomy, by which similar instances become members of the same class. Relative similarity is determined by depth in the taxonomy. The conceptual clustering algorithm, its use in query processing, and an example are presented ICDE Fast Read-Only Transactions in Replicated Databases. P. C. Aristides,Amr El Abbadi 1992 The authors present a propagation mechanism, called the commit propagation mechanism (CPM), which increases the availability of data for read-only transactions. The proposed mechanism is piggy-backed on the messages used in the two-phase commit protocol. The CPM was combined with the standard quorum protocol in two different replicated database systems. In a fully replicated database, CPM allows any read-only transaction to execute locally at a single site without the need for any communication overhead. In a partially replicated database, CPM either ensures that the set of copies residing at a site are mutually consistent, or indicates which copies violate such consistency ICDE M(DM): An Open Framework for Interoperation of Multimodel Multidatabase Systems. Thierry Barsalou,Dipayan Gangopadhyay 1992 "The authors present M(DM), an extensible metalevel system in which the syntax and the semantics of data models, schemas, and databases can be uniformly represented. M(DM) consists primarily of a set of metatypes that capture and express data-model constructs in second-order logic: a data model is represented as a collection of M(DM) metatypes. To achieve extensibility, M(DM)'s metatypes are organized into an inheritance lattice. The robustness and openness of the approach are demonstrated by expressing a variety of data models in M(DM), and the authors show how to exploit M(DM)'s metalevel capabilities for hiding representational heterogeneities in multimodel multidatabase systems" ICDE On Mixing Queries and Transactions via Multiversion Locking. Paul M. Bober,Michael J. Carey 1992 The authors discuss a novel approach to multiversion concurrency control that allows high-performance transaction systems to support long-running queries. The approach extends the multiversion locking algorithm developed by Computer Corporation of America by using record-level versioning and reserving a portion of each data page for caching prior versions that are potentially needed for the serializable execution of queries; on-page caching also enables an efficient approach to garbage collection of old versions. In addition, view sharing is introduced, which has the potential for reducing the cost of versioning by grouping together queries to run against the same transaction-consistent view of the database. Results from a simulation study that indicate that the approach is a viable alternative to level-one and level-two consistency locking when the portion of each data reserved for prior versions is chosen appropriately are presented ICDE Transactions in Distributed Shared Memory Systems. Peter Bodorik,F. I. Smith,D. J-Lewis 1992 The authors propose a distributed shared memory model based on a paged segmented two-level address space and an extended set of memory operations. In addition to the traditional read and write operations, the memory model includes operations which support mapping between local and global address spaces and mapping of processes to transactions. An architecture and associated algorithm are outlined for a virtual memory management unit to provide concurrency control for transactions. Although the traditional concept of the transaction is assumed, only the aspects of concurrency control and coherence are addressed ICDE An Exploratory Study of Ad Hoc Query Languages to Databases. John E. Bell,Lawrence A. Rowe 1992 The authors describe an exploratory study performed to compare three different interface styles for ad hoc query to a database. Subjects with wide-ranging computer experience performed queries of varying difficulty using either an artificial, a graphical, or a natural language interface. All three interfaces were commercial products. The study revealed strengths and weaknesses of each interface and showed that interaction with the natural language interface was qualitatively different than interaction with either the graphical or artificial language systems ICDE Title, Message from the General Chairperson, Message from the Program Chairperson, Committees, Referees, Table of Contents, Author Index. 1992 Title, Message from the General Chairperson, Message from the Program Chairperson, Committees, Referees, Table of Contents, Author Index. ICDE On Interoperability for KBMS Applications - The Horizontal Integration Task. Wolfgang Benn,Christian Kortenbreer,Gunter Schlageter,Xinglin Wu 1992 On Interoperability for KBMS Applications - The Horizontal Integration Task. ICDE An Efficient Database Storage Structure for Large Dynamic Objects. Alexandros Biliris 1992 The author presents storage structures and algorithms for the efficient manipulation of general-purpose large unstructured objects in a database system. The large object is stored in a sequence of variable-size segments, each of which consists of a large number of physically contiguous disk blocks. A tree structure indexes byte positions within the object. Disk space management is based on the binary buddy system. The scheme supports operations that replace, insert, delete bytes at arbitrary positions within the object, and append bytes at the end of the object ICDE Data Hiding and Security in Object-Oriented Databases. Elisa Bertino 1992 Data Hiding and Security in Object-Oriented Databases. ICDE Performance Comparisons of Distributed Deadlock Detection Algorithms. Omran A. Bukhres 1992 Performance Comparisons of Distributed Deadlock Detection Algorithms. ICDE A Model for Optimizing Deductive and Object-Oriented DB Requests. Jean-Pierre Cheiney,Rosana S. G. Lanzelotte 1992 A Model for Optimizing Deductive and Object-Oriented DB Requests. ICDE A Declarative Approach to Active Databases. Stefano Ceri 1992 A Declarative Approach to Active Databases. ICDE Chain-Split Evaluation in Deductive Databases. Jiawei Han 1992 Many popularly studied recursions in deductive databases can be compiled into one or a set of highly regular chain generating paths, each of which consists of one or a set of connected predicates. Previous studies on chain-based query evaluation in deductive databases take a chain generating path as an inseparable unit in the evaluation. However, some recursions, especially many functional recursions whose compiled chain consists of infinitely evaluable function(s), should be evaluated by chain-split evaluation, which splits a chain generating path into two portions in the evaluation: an immediately evaluable portion and a delayed-evaluation portion. In this paper, the necessity of chain-split evaluation is examined from the points of view of both efficiency and finite evaluation, and three chain-split evaluation techniques: magic sets, buffered evaluation, and partial evaluation are developed. Our study shows that chain-split evaluation is a primitive recursive query evaluation technique for different kinds of recursions, and it can be implemented efficiently in deductive databases by extensions to the existing recursive query evaluation methods. ICDE Scheduling and Processor Allocation for Parallel Execution of Multi-Join Queries. Ming-Syan Chen,Philip S. Yu,Kun-Lung Wu 1992 Scheduling and Processor Allocation for Parallel Execution of Multi-Join Queries. ICDE History-less Checking of Dynamic Integrity Constraints. Jan Chomicki 1992 An efficient implementation method is described for dynamic integrity constraints formulated in past temporal logic. Although the constraints can refer to past states of the database, their checking does not require that the entire database history be stored. Instead, every database state is extended with auxiliary relations that contain the historical information necessary for checking constraints. Auxiliary relations can be implemented as materialized relational views. The author analyzes the computational cost of the method and outlines how it can be implemented by using existing database technology. Related work on dynamic integrity constraints is surveyed ICDE Object Allocation in Distributed Systems with Virtual Replication. Wesley W. Chu,Berthier A. Ribeiro-Neto,Patrick H. Ngai 1992 Object Allocation in Distributed Systems with Virtual Replication. ICDE A Spanning Tree Transitive Closure Algorithm. Shaul Dar,H. V. Jagadish 1992 The authors present a transitive closure algorithm that maintains a spanning tree of successors for each node rather than a simple successor list. This spanning tree structure promotes sharing of information across multiple nodes and leads to more efficient algorithms. An effective relational implementation of the spanning tree storage structure is suggested, and it is shown how blocking can be applied to reduce the input/output cost of the algorithm. The algorithm can handle path problems also. Analytical and experimental evidence is presented that demonstrates the utility of the algorithm, especially in a graph with many alternate paths between the nodes. The spanning tree storage structure can be compressed and updated incrementally in response to changes in the underlying graph ICDE An Object-Oriented Model for Capturing Data Semantics. G. Decorte,A. Eiger,D. Kroenke,T. Kyte 1992 An Object-Oriented Model for Capturing Data Semantics. ICDE The Design and Implementation of a Parallel Join Algorithm for Nested Relations on Shared-Memory Multiprocessors. Vinay Deshpande,Per-Åke Larson 1992 The Design and Implementation of a Parallel Join Algorithm for Nested Relations on Shared-Memory Multiprocessors. ICDE Partitioning of Time Index for Optical Disks. Ramez Elmasri,Muhammad Jaseemuddin,Vram Kouramajian 1992 The authors present a storage model for temporal databases that accommodates large amounts of temporal data. The model supports efficient search for object versions based on temporal conditions, using a time index. They define an access structure, the monotonic B+ -tree, that is suitable for implementing a time index for append-only temporal databases. The storage model uses a combination of magnetic disks and write-once optical disks to keep current, past, and even future states of a database online and readily accessible. It provides an automatic archiving of both object versions and time index blocks to optical disks ICDE A Time-based Distributed Optimistic Recovery and Concurrency Control Mechanism. Anat Gafni,K. V. Bapa Rao 1992 The authors describe a time-based approach to distributed concurrency control and recovery that alleviates the high cost of optimistic methods by combining the solutions to concurrency control, recovery management, and localized control into a single flexible yet powerful and efficient mechanism. The approach adapts the object-oriented Timewarp mechanism to handle competing processes rather than the cooperating processes for which it was originally intended. The result is a completely decentralized, nonblocking concurrency and recovery protocol that supports more general features of other desirable aspects of distributed applications, such as versioning and active objects ICDE How to Extend a Conventional Optimizer to Handle One- and Two-Sided Outerjoin. César A. Galindo-Legaria,Arnon Rosenthal 1992 The authors provide a nearly complete theory for reordering join/outerjoin queries. The theory is used to describe modular extensions that strengthen a conventional optimizer to handle nearly all select/project/join/outerjoin queries. Unlike previous work, these results are not limited to queries possessing a nice structure, or queries that are nicely represented in relational calculus. The theoretical results concern query simplification and reassociation using a generalized outerjoin ICDE ESQL2: An Object-Oriented SQL with F-Logic Semantics. Georges Gardarin,Patrick Valduriez 1992 ESQL2: An Object-Oriented SQL with F-Logic Semantics. ICDE Quorum-oriented Multicast Protocols for Data Replication. Richard A. Golding,Darrell D. E. Long 1992 Many wide-area distributed applications use replicated data to improve the availability of the data, and to improve access latency by locating copies of the data near to their use. This paper presents a new family of communication protocols, called *quorum multicasts*, that provide efficient communication services for widely replicated data. Quorum multicasts are similar to ordinary multicasts, which deliver a message to a set of destinations. The new protocols extend this model by allowing delivery to a subset of the destinations, selected according to distance or expected data currency. These protocols provide well-defined failure semantics, and can distinguish between communication failure and replica failure with high probability. We have evaluated their performance, which required taking several traces of the Internet to determine distributions for communication latency and failure. A simulation study of quorum multicasts, based on these measurements, shows that these protocols provide low latency and require few messages. A second study that measured a test application running at several sites confirmed these results. ICDE Research Directions in Image Database Management (Panel). William I. Grosky,Rajiv Mehrotra 1992 Research Directions in Image Database Management (Panel). ICDE An Abstraction Mechanism for Modeling Generation. Ranabir Gupta,Gary Hall 1992 An abstraction mechanism for modeling the generation of entities from other entities is presented. It is shown to be useful for representing a wide range of generative processes, including those which are reversible, spontaneous, and involve multiple inputs and outputs. The mechanism is related to an earlier defined abstraction mechanism which represents the transitional behavior of existing entities. The model of generation is described in terms of an operational formalism with well-understood properties ICDE A Run-Time Execution Model for Referential Integrity Maintenance. Bruce M. Horowitz 1992 The author explores anomalous situations which can arise during the processing of database operations on schemas which include referential integrity constraints. A more declarative (as opposed to operational) approach to referential integrity is proposed. He reviews some of the contributions which ANSI Technical Committee X3H2 on Database has made in this area, and which have been reflected in the referential integrity specifications of the forthcoming SQL2 and SQL3 standards. A proposal that replaced the recent compile-time paradigm with a run-time paradigm is included. A correctness proof for this paradigm is provided ICDE Distributed Rule Processing in Active Databases. Ing-Miin Hsu,Mukesh Singhal,Ming T. Liu 1992 Processing rules in a distributed active database involves three design issues: how to decompose rules, how to distribute rules to sites, and how to evaluate distributed rules correctly. The authors study these three issues for complicated rules, which are complex and time-consuming to evaluate. They propose a relational operator, AND, and the associated algebraic manipulations of this operator to find independent parts of a rule query, which can be distributed among sites. Due to geographical dispersion in a distributed system, correct evaluation of distributed rules is not trivial. A distributed evaluation algorithm is preferred, which guarantees the correctness of the evaluation result of the distributed rule by collecting consistent local results from sites to form a global view ICDE Semantically Consistent Schedules for Efficient and Concurrent B-Tree Restructuring. Ragaa Ishak 1992 A concurrent B-tree algorithm can achieve more parallelism than a standard concurrency control method. The author presents a semantically based method for B-tree restructuring which allows efficient and concurrent traversals and fetches. The concurrent operations compare favorably with earlier solutions because they avoid wasted input/output (I/O). In addition, the concurrent B-tree algorithms considerably reduce the need to repeatedly traverse the tree in order to recover from the effect of in-progress restructuring. The method increases the performance of a high-volume database management system ICDE Imprecise and Uncertain Information in Databases: An Evidential Approach. Suk Kyoon Lee 1992 A novel approach for representing imprecise and uncertain data and evaluating queries in the framework of an extended relational database model based on the Dempster-Shafer theory of evidence is proposed. Because of the ability to combine evidences from different sources, the semantics of the update operation of imprecise or uncertain data is reconsidered. By including an undefined value in a domain, three different cases of a null value are presented: unknown, inapplicable, and unknown or inapplicable. In this model, two levels of uncertainty in the database are supported: one is for the attribute value level and the other is for the tuple level ICDE Using Coding to Support Data Resiliency in Distributed Systems. Pankaj Jalote,Gagan Agrawal 1992 A scheme for maintaining replicated files is suggested. The authors describe how the coding scheme suggested by M.O. Rabin (1987, 1989) can be used to store replicated data and how the voting algorithm and the quorum requirements change to manage this replication. It is shown that the disk storage space required to achieve a given availability is significantly lower than that for the conventional scheme with full file replication. Since coding is used, this scheme also provides a high degree of data security ICDE Mapping a Version Model to a Complex-Object Data Model. Wolfgang Käfer,Harald Schöning 1992 The authors present a version model for CAD purposes and its implementation on the basis of a complex-object database management system. The functionality of the model is illustrated with the help of a VLSI design example. In contrast to similar solutions based on the relational data model, this approach allows for a simple and efficient implementation of the version model, allowing for powerful retrieval operations. Sharing of data, which occurs necessarily among versions, is system controlled. This prohibits redundant storage of data. It is concluded that implementing a complex-object database system supporting versions is not more complicated than implementing a complex-object database system without version support ICDE Temporal Specialization. Christian S. Jensen,Richard T. Snodgrass 1992 The authors explore a variety of temporal relations with specialized relationships between transaction and valid time. An example is a retroactive temporal event relation, where the event must have occurred before it was stored, i.e., the valid time-stamp is restricted to be less than the transaction time-stamp. The authors discuss many useful restrictions, defining a large number of specialized types of temporal relations, and indicate some of their applications. A detailed taxonomy of specialized temporal relations is presented. This taxonomy may be used during database design to specify the particular time semantics of temporal relations ICDE I/O-Efficiency of Shortest Path Algorithms: An Analysis. Bin Jiang 1992 To establish the behavior of algorithms in a paging environment, the author analyzes the input/output (I/O) efficiency of several representative shortest path algorithms. These algorithms include single-course, multisource, and all pairs ones. The results are also applicable for other path problems such as longest paths, most reliable paths, and bill of materials. The author introduces the notation and a model of a paging environment. The I/O efficiencies of the selected single-source, all pairs, and multisource algorithms are analyzed and discussed ICDE Exploring Semantics in Aggregation Hierarchies for Object-Oriented Databases. Ling Liu 1992 An extended object-oriented model for exploring semantics in aggregation hierarchies is presented. The extension is mainly based on a general distinction between aggregation references and association references and a support for type inheritance in both specialization and aggregation abstractions. The author formally describes notions of aggregation reference and aggregation hierarchy and introduces the concept of aggregation inheritance as a type composition mechanism for sharing specifications among types. The similarities and differences between aggregation inheritance and subtype inheritance are analyzed. It is shown that a combination of these two types of inheritance provides a powerful mechanism for abstract implementation of behavior and for enhancing the extensibility of the object model ICDE Parallel GRACE Hash Join on Shared-Everything Multiprocessor: Implementation and Performance Evaluation on Symmetry S81. Masaru Kitsuregawa,Shin-ichiro Tsudaka,Miyuki Nakano 1992 The authors implemented a parallel hash join algorithm on a Symmetry S81 shared-everything multiprocessor environment and evaluated the performance. They evaluated the input/output (I/O) performance on a multiple-disk environment, and showed linear performance increase of up to eight disks. The performance of the implemented join operation was examined on each phase, and the effect of parallel processing by the multiprocessor and the multiple disks was clarified. It was concluded from the experimental result that on such a shared-everything multiprocessor system parallelism could be easily exploited for the construction of high-performance relational database systems ICDE Deleted Tuples are Useful when Updating through Universal Scheme Interfaces. Dominique Laurent,Viet Phan Luong,Nicolas Spyratos 1992 The authors present a novel approach to database updating through universal scheme interfaces. The main contribution of the approach is the elimination of non-determinism. That is, contrary to most other approaches, inserting or deleting a tuple can always be done without having to make any choice at all. Tuples that have been deleted from the database are explicitly stored and are used subsequently in order to invalidate certain derivations. Updates are performed in a monotonous manner, and updates satisfy the property of reversibility ICDE An Efficient Object-based Algorithm for Spatial Searching, Insertion and Deletion. Jui-Tine Lee,Geneva G. Belford 1992 The authors propose an object-based index structure for manipulating spatial objects with non-zero size. They introduce the main ideas of the proposed index structure. A detailed description of the algorithms is given for searching, insertion, and deletion in a database system with a high frequency of retrievals and a low frequency of insertions and deletions. The algorithms are then described for retrievals, insertions, and deletions for a database system with a nearly equal frequency of retrievals, insertions and deletions ICDE Distance-Associated Join Indices for Spatial Range Search. Wei Lu,Jiawei Han 1992 A distance-associated join index structure is developed to speed up spatial queries, especially for spatial range queries. Three distance-associated join indexing mechanisms: basic, ring-structured, and hierarchical, are presented and studied. The analysis and performance study shows that distance-associated spatial join indices substantially improve the performance of spatial queries, and different structures are best suited for different applications ICDE On Semantic Query Optimization in Deductive Databases. Laks V. S. Lakshmanan,Rokia Missaoui 1992 On Semantic Query Optimization in Deductive Databases. ICDE Logical Database Design with Inclusion Dependencies. Tok Wang Ling,Cheng Hian Goh 1992 Classical data dependencies are oblivious to important constraints which may exist between sets of attributes occurring in different relation schemes. The authors study how inclusion dependencies can be used to model these constraints, leading to the design of better database schemes. A normal form called the inclusion normal form (IN-NF) is proposed. Unlike classical normal forms, the IN-NF characterizes a database scheme as a whole rather than the individual relation schemes. It is shown that a database scheme in IN-NF is always in improved third normal form, while the converse is not true. It is demonstrated that the classical relational design framework may be extended to facilitate the design of database schemes in IN-NF ICDE A Relation Merging Technique for Relational Databases. Victor M. Markowitz 1992 A merging technique for relational schemas consisting of relation-schemes, key dependencies, referential integrity constraints, and null constraints is presented. The author examines the conditions required for using this technique with relational database management systems that provide different mechanisms for maintaining null and referential integrity constraints. For relational schemas developed using an extended entity-relationship (EER) oriented design methodology, it is shown that a relation-scheme can be used for representing multiple object-sets not only for the standard binary many-to-one relationship-set structure, but for more complex structures as well ICDE Database Structure and Discovery Tools for Integrated Circuit Reliability Evaluation. Paola Mauri 1992 The reliability performance of integrated circuits is described by means of a large amount of quantitative and qualitative data that require computer tools for effective management. The author describes some design solutions in the implementation of these tools, in particular stressing the integration of different points of view to model reliability performance; the structure of the failure database that, following expert reasoning features, implicitly contains as cause-effect relationships the results of the failure analysis; and the procedures designed to discover regularities and relationships among stored data, thus helping engineers in reliability evaluation and failure analysis ICDE Processing Hierarchical Queries in Heterogeneous Environment. Weiyi Meng,Clement T. Yu,Won Kim 1992 Processing Hierarchical Queries in Heterogeneous Environment. ICDE An Extensible Object-Oriented Database Testbed. Magdi M. A. Morsi,Shamkant B. Navathe,Hyoung-Joo Kim 1992 The authors describe the object-oriented design and implementation of an extensible schema manager for object-oriented databases. The open class hierarchy approach has been adopted to achieve the extensibility of the implementation. In this approach. the system meta information is implemented as objects of system classes. A graphical interface for an object-oriented database scheme environment, GOOSE, has been developed. GOOSE supports several advanced features which include schema evolution, schema versioning, and DAG (direct acyclic graph) rearrangement view of a class hierarchy. Schema evolution is the ability to make a variety of changes to a database scheme without reorganization. Schema versioning is the ability to define multiple scheme versions and to keep track of schema changes. A novel type of view for object-oriented databases, the DAG rearrangement view of a class hierarchy, is also supported ICDE Database Recovery Using Redundant Disk Arrays. Antoine N. Mourad,W. Kent Fuchs,Daniel G. Saab 1992 Database Recovery Using Redundant Disk Arrays. ICDE Relational Databases with Exclusive Disjunctions. Adegbemiga Ola 1992 The author presents a mechanism for representing exclusive disjunctive information in database tables using various tuple types and a range for the count of the number of tuples in the unknown relation denoted by a table. The relational algebra operators are extended to take the new tables as operands. Query evaluation in the extended model is sound and complete for relational algebra expressions consisting of projection, difference, Cartesian product, or selection operators. Possible storage structures for storing the base tables and algorithms for inserting tuples into a table are described ICDE Maintenance of Materialized Views of Sampling Queries. Frank Olken,Doron Rotem 1992 The authors discuss materialized views of random sampling queries of a relational database. They show how to maintain such views in the presence of insertions, deletions, and updates of the base relations. The basic idea is to reuse the maximal portion of the original sample when constructing the updated sample. The results are based on a synthesis of view update techniques and sampling algorithms. It is demonstrated that maintenance of materialized sample views may be substantially cheaper than resampling ICDE Concurrent File Reorganization for Record Clustering: A Performance Study. Edward Omiecinski,Liehuey Lee,Peter Scheuermann 1992 "The authors presents performance analysis of a concurrent file reorganization algorithm. They examined the effect of buffer size, degree of reorganization, and write probability of transactions on system throughput. The problem of file reorganization considered involves altering the placement of records on pages on a secondary storage device. This reorganization must be done in-place. The approach is appropriate for a non-in-place reorganization. The motivation for such a physical change is to improve the database system's performance, by minimizing the number of page accesses made in answering a set of queries. It is shown through simulation that the algorithm, when run concurrently with user transactions, provides an acceptable level of overall database system performance" ICDE Processing Real-Time, Non-Aggregate Queries with Time-Constraints in CASE-DB. Gultekin Özsoyoglu,Kaizheng Du,Sujatha Guru Swamy,Wen-Chi Hou 1992 Processing Real-Time, Non-Aggregate Queries with Time-Constraints in CASE-DB. ICDE A Keying Method for a Nested Relational Database Management System. Z. Meral Özsoyoglu,Jian Wang 1992 A Keying Method for a Nested Relational Database Management System. ICDE Prefetching with Multiple Disks for External Mergesort: Simulation and Analysis. Vinay S. Pai,Peter J. Varman 1992 The authors present a simulation study of multiple disk systems to improve the input/output (I/O) performance of multiway merging. With the increase in the size of main memory in computer systems, multiple disks and aggressive prefetching can be used to significantly reduce I/O time. Two prefetching strategies-intra-run and inter-run-for external merging using multiple disks were studied. Their performance was evaluated, and simple analytical expressions are derived to explain their asymptotic behavior. The results indicate that a combination of the strategies can result in a significant reduction in I/O time ICDE A Periodic Deadlock Detection and Resolution Algorithm with a New Graph Model for Sequential Transaction Processing. Young Chul Park,Peter Scheuermann,Sang Ho Lee 1992 The authors address the deadlock problem in sequential transaction processing where the strict two-phase locking and the multiple granularity locking protocol with five lock modes are used. The scheduling policy honors lock requests in a first-in-first-out basis except for lock conversions. As a basic tool, a direct graph model called the holder/wire-transaction waited-by graph (H/W-TWBG) is introduced to capture the precise status of systems in terms of deadlock. The properties of H/W-TWBG are presented. Based on H/W-TWBG, the identification principles of the victim candidates are established in a deadlock cycle, and a periodic deadlock detection and resolution algorithm which has a reasonable time and storage complexity is preserved. One important feature of the deadlock resolution scheme is that some deadlocks can be resolved without aborting any transaction ICDE Parallel Algorithms for Executing Joins on Cube-Conneced Multicomputers. Manuel A. Penaloza,Esen A. Ozkarahan 1992 Parallel Algorithms for Executing Joins on Cube-Conneced Multicomputers. ICDE A Fault-Tolerant Algorithm for Replicated Data Management. Sampath Rangarajan,Sanjeev Setia,Satish K. Tripathi 1992 In this paper, we examine the tradeoff between message overhead and data availability that arises in the design of fault-tolerant algorithms for replicated data management in distributed systems. We propose a property called asymptotically high resiliency which is useful for evaluating the fault-tolerance of replica control algorithms and distributed mutual exclusion algorithms. We present a new algorithm for replica control that can be tailored (through a design parameter) to achieve the desired balance between low message overhead and high data availability. Further, we show that for a message overhead of ${\bf O}(\sqrt{N{\rm log}N})$, our algorithm can achieve asymptotically high resiliency. ICDE Object-Oriented Models for Heterogeneous Multidatabase Management Systems. Ming-Chien Shan 1992 Object-Oriented Models for Heterogeneous Multidatabase Management Systems. ICDE Probabilistic Dignosis of Hot Spots. Kenneth Salem,Daniel Barbará,Richard J. Lipton 1992 Probabilistic Dignosis of Hot Spots. ICDE MoBiLe Files and Efficient Processing of Path Queries on Scientific Data. Shashi Shekhar,Toneluh Andrew Yang 1992 Efforts in database design for observational scientific data concerned with path query specification and an access method design are discussed. The goal is to understand the issues related to scientific data and computations. The authors propose a representation of path queries and an access method, MoBiLe files, to capture space-time continuity. A survey on spatial indexing methods is given. The model of scientific data and computation is described. An example of scientific databases is also given. Path queries are then specified. The MoBiLe mapping functions and MoBiLe file design are described. The experiment used to verify the access methods and the data model are outlined. The experimental results and this observation and analysis are presented ICDE Object-Oriented Modeling and Design of Coupled Knowledge-base/Database Systems. Olivia R. Liu Sheng,Chih-Ping Wei 1992 The objective is to develop a structured object-oriented modeling and design methodology for coupled knowledge-base/database (KB/DB) systems by exploring the useful principles and features of object-oriented modeling and software development techniques. The methodology uses a synthesize object-oriented entity-relationship model for representing the knowledge and the embedded data semantics involved in coupled KB/DB systems. An associated design procedure is presented. This methodology improves on existing coupled KB/DB design methods because of its well-defined constructs that deal with various forms of knowledge involved in data processing, knowledge-based problem solving and object-oriented reasoning ICDE Utilization of External Foreign Computation Services. Hans-Jörg Schek 1992 Utilization of External Foreign Computation Services. ICDE An Integrated Real-Time Locking Protocol. Sang Hyuk Son,Seog Park,Yi Lin 1992 The authors examine a priority-driven locking protocol called integrated real-time locking protocol. They show that this protocol is free of deadlock, and in addition, a high-priority transaction is not blocked by uncommitted lower protocol. They show that this protocol is free of deadlock, and in addition, a high-priority transaction is not blocked by uncommitted lower priority transactions. The protocol does not assume any knowledge about the data requirements or the execution time of each transaction. This makes the protocol widely applicable, since in many actual environments such information may not be readily available. Using a database prototyping environment, it was shown that the proposed protocol offers a performance improvement over the two-phase locking protocol ICDE An Index Implementation Supporting Fast Recovery for the POSTGRES Storage System. Mark Sullivan,Michael A. Olson 1992 The authors present two algorithms for maintaining B-tree index consistency in a database management system which does not use write-ahead logging (WAL). One algorithm is similar to shadow paging, but improves performance by integrating shadow meta-data with index meta-data. The other algorithm uses a two-phase page reorganization scheme to reduce the space overhead caused by shadow paging. Although designed for the POSTGRES storage system, these algorithms should also be useful in a WAL-based storage system, as support for logical logging. Measurements and analysis of a prototype implementation suggest that the algorithms will have little impact on data manager performance ICDE Thrashing in Two-Phase Locking Revisited. Alexander Thomasian 1992 Thrashing in Two-Phase Locking Revisited. ICDE Query Optimization for KBMSs: Temporal, Syntactic and Semantic Transformantions. Thodoros Topaloglou,Arantza Illarramendi,Licia Sbattella 1992 Query Optimization for KBMSs: Temporal, Syntactic and Semantic Transformantions. ICDE Optimal Versioning of Objects. Vassilis J. Tsotras,B. Gopinath 1992 The purpose of versioning is to reconstruct any past state of an object class. The authors show that access to any past version is possible in almost constant time, while the space used is only linear to the number of changes occurring in the class evolution. As a result, versioning with fast reconstruction can be supported in an object-oriented environment without using excessive space requirements. It is also proved that the solution is optimal among all approaches that use the same space limitations. A crucial characteristic of the results is that they can be easily implemented on a storage facility that uses a magnetic disk and an optical disk ICDE The Implementation and Evaluation of Integrity Maintenance Rules in an Object-Oriented Database. Susan Darling Urban,Anton P. Karadimce,Ravi B. Nannapaneni 1992 The authors describe an approach to the declarative representation of integrity constraints in an object-oriented database and the use of integrity maintenance rules for the active maintenance of constraints. A semantic data model is used to automatically generate class definitions and state-altering database operations with constraints represented as objects in the database. Integrity maintenance production rules are automatically generated from constraints and stored as extensions to class operations, hiding the details of constraint checking and rule triggering. High-level transactions call state-altering operations and invoke the integrity maintenance process at commit time. Integrity constraints are declaratively represented in the database system, with operations encapsulating rules about how to respond to constraint violations. An analysis of problems associated with cyclic and anomalous rule behavior ICDE Prepare and Commit Certification for Decentralized Transaction Management in Rigorous Heterogeneous Multidatabases. Jari Veijalainen,Antoni Wolski 1992 Algorithms for scheduling of distributed transactions in a heterogeneous multidatabase, in the presence of failures, are presented. The algorithms of prepare certification and commit certification protect against serialization errors called global view distortions and local view distortions. View serializable overall histories are guaranteed in the presence of most typical failures. The assumptions are, among others, that the participating database systems produce rigorous histories and that no local transaction may update the data accessed by a global transaction that is in the prepared state. The main advantage of the method, as compared to other known solutions, is that it is totally decentralized ICDE A Performance Comparison of the Rete and TREAT Algorithms for Testing Database Rule Conditions. Yu-Wang Wang,Eric N. Hanson 1992 The authors present the results of a simulation comparing the performance of the two most widely used production rule condition testing algorithms, Rete and TREAT, in the context of a database rule system. The results show that TREAT almost always outperforms Rete. TREAT requires less storage than Rete, and is less sensitive to optimization decisions than Rete. Based on these results, it is concluded that TREAT is the preferred algorithm for testing join conditions of database rules. Since Rete does outperform TREAT in some cases, this study suggests a next step which would be to develop a hybrid version of Rete and TREAT with an optimizer that would decide which strategy to use based on the rule definition and statistics about the data and update patterns ICDE Divergence Control for Epsilon-Serializability. Kun-Lung Wu,Philip S. Yu,Calton Pu 1992 The authors present divergence control methods for epsilon-serializability (ESR) in centralized databases. ESR alleviates the strictness of serializability (SR) in transaction processing by allowing for limited inconsistency. The bounded inconsistency is automatically maintained by divergence control (DC) methods in a way similar to the manner in which SR is maintained by concurrency control mechanisms, but DC for ESR allows more concurrency. Concrete representative instances of divergence-control methods are described based on two-phase locking, timestamp ordering, and optimistic approaches. The applicability of ESR is demonstrated by presenting the designs of DC methods using other most known inconsistency specifications, such as absolute value, age, and total number of nonserializably read data items ICDE A Uniform Model for Temporal Object-Oriented Databases. Gene T. J. Wuu,Umeshwar Dayal 1992 A temporal object-oriented model and query language that supports the modeling and manipulation of complex temporal or versioned objects is developed. The authors show that the approach not only provides a richer model than the relational for capturing the semantics of complex temporal objects, but also requires no special constructs in the query language. Consequently, the retrieval of temporal and non-temporal information is uniformly expressed. By allowing variables and quantifiers to range over time, can be formulated that require special operators in other languages. Temporal aggregation queries, which are not easily expressed in other models, are expressed using the same aggregation operators as for nontemporal data ICDE Hot-Spot Based Compostion Algorithm. Shu-Shang Wei,Yao-Nan Lien,Dik Lun Lee,Ten-Hwang Lai 1992 Hot-Spot Based Compostion Algorithm. ICDE Dynamic Self-Configuring Methods for Graphical Presentation of ODBMS Objects. Randal V. Zoeller,Douglas K. Barry 1992 "The authors describe a preliminary implementation of the self-configuring methods for automatic representation of time-altered ITASCA entities (SMARTIE) system. SMARTIE defines a set of visual display methods for the ITASCA distributed object database management system. These dynamic methods provide a framework that assists the user in visually browsing instance objects in the ITASCA data space. The methods also support graphic in-place editing of existing instances as well as creation of new instances. The querying and presentation of instances requires no programming by the user or the designer, and support for new classes is provided simply by inheriting the methods. To coexist with ITASCA's dynamic schema modification, the SMARTIE methods automatically reconfigure themselves to reflect the current schema definition" ICDE Effect of System Dynamics on Coupling Architectures for Transaction Processing. Philip S. Yu,Asit Dan 1992 The authors present a comparison on the resilience of the performance to system dynamics of three multinode architectures for transaction processing. They describe the different architectures considered. The issues of system dynamics are addressed. The performance model is outlined. Three specific scenarios are considered: (1) a sudden load surge in one of the transaction classes, (2) varying transaction rates for all transaction classes, and (3) failure of a single node. It was found that the different architectures require different amounts of capacity to be reserved to cope with these dynamic situations. Quantitative comparisons of the three architectures are given SIGMOD Conference Fast Search in Main Memory Databases. Anastasia Analyti,Sakti Pramanik 1992 The objective of this paper is to develop and analyze high performance hash based search methods for main memory databases. We define optimal search in main memory databases as the search that requires at most one key comparison to locate a record. Existing hashing techniques become impractical when they are adapted to yield optimal search in main memory databases because of their large directory size. Multi-directory hashing techniques can provide significantly improved directory utilization over single-directory hashing techniques. A multi-directory hashing scheme, called fast search multi-directory hashing, and its generalization, called controlled search multi-directory hashing, are presented. Both methods achieve linearly increasing expected directory size with the number of records. Their performance is compared to existing alternatives. SIGMOD Conference Behavior of Database Production Rules: Termination, Confluence, and Observable Determinism. Alexander Aiken,Jennifer Widom,Joseph M. Hellerstein 1992 Static analysis methods are given for determining whether arbitrary sets of database production rules are (1) guaranteed to terminate; (2) guaranteed to produce a unique final database state; (3) guaranteed to produce a unique stream of observable actions. When the analysis determines that one of these properties is not guaranteed, it isolates the rules responsible for the problem and determines criteria that, if satisfied, guarantee the property. The analysis methods are presented in the context of the Starburst Rule System; they will form the basis of an interactive development environment for Starburst rule programmers. SIGMOD Conference Using Delayed Commitment in Locking Protocols for Real-Time Databases. Divyakant Agrawal,Amr El Abbadi,Richard Jeffers 1992 In this paper, we propose locking protocols that are useful for real-time databases. Our approach is motivated from two main observations. First, locking protocols are widely accepted and used in most database systems. Second, in real-time databases it has been shown that the blocking behavior of transactions in locking protocols results in performance degradation. We use a new relationship between locks called ordered sharing to eliminate blocking that arises in the traditional locking protocols. Ordered sharing eliminates blocking of read and write operations but may result in delayed commitment. Since in real-time databases, timeliness and not response time is the crucial factor, or protocols exploit this delay to allow transactions to execute within the slacks of delayed transactions. We compare the performance of the proposed protocols with the two phase locking protocol for real-time databases. Our experiments indicate that the propose protocols significantly reduce the percentage of missed deadlines in the system for a variety of workloads. SIGMOD Conference The Term Retrieval Machine. Michael Ley 1992 The Term Retrieval Machine. SIGMOD Conference Using Multiversioning to Improve Performance Without Loss of Consistency. Roger Bamford 1992 Using Multiversioning to Improve Performance Without Loss of Consistency. SIGMOD Conference The O2 Object-Oriented Database System. François Bancilhon 1992 The O2 Object-Oriented Database System. SIGMOD Conference An Efficient Scheme for Providing High Availability. Anupam Bhide,Ambuj Goyal,Hui-I Hsiao,Anant Jhingran 1992 "Replication at the partition level is a promising approach for increasing availability in a Shared Nothing architecture. We propose an algorithm for maintaining replicas with little overhead during normal failure-free processing. Our mechanism updates the secondary replica in an asynchronous manner: entire dirty pages are sent to the secondary at some time before they are discarded from primary's buffer. A log server node (hardened against failures) maintains the log for each node. If a primary node fails, the secondary fetches the log from the log server, applied it to its replica, and brings itself to the primary's last transaction-consistent state. We study the performance of various policies for sending pages to secondary and the corresponding trade-offs between recovery time and overhead during failure-free processing." SIGMOD Conference Implementation of General Constraints in SIM. Richard Bigelow 1992 Implementation of General Constraints in SIM. SIGMOD Conference The Performance of Three Database Storage Structures for Managing Large Objects. Alexandros Biliris 1992 This study analyzes the performance of the storage structures and algorithms employed in three experimental database storage systems – EXODUS, Starburst, and EOS – for managing large unstructured general-purpose objects. All three mechanisms are segment-based in that the large object is stored in a sequence of segments, each consisting of physically continuous disk block. To analyze the algorithms we measured object creation time, sequential scan time, storage utilization in the presence of updates, and the I/O cost of random reads, inserts, and deletes. SIGMOD Conference ITASCA Distributed ODBMS. Douglas K. Barry 1992 ITASCA Distributed ODBMS. SIGMOD Conference Extending Ingres with Methods and Triggers. Fred Carter 1992 Extending Ingres with Methods and Triggers. SIGMOD Conference Conceptual Document Browsing and Retrieval in Kabiria. Augusto Celentano,Maria Grazia Fugini,Silvano Pozzi 1992 Conceptual Document Browsing and Retrieval in Kabiria. SIGMOD Conference Distribution, Parallelism, and Availability in NonStop SQL. Pedro Celis 1992 Distribution, Parallelism, and Availability in NonStop SQL. SIGMOD Conference The Design and Implementation of Persistent Transactions in an Object Database System. Hong-Tai Chou 1992 The Design and Implementation of Persistent Transactions in an Object Database System. SIGMOD Conference A General Framework for the Optimization of Object-Oriented Queries. Sophie Cluet,Claude Delobel 1992 The goal of this work is to integrate in a general framework the different query optimization techniques that have been proposed in the object-oriented context. As a first step, we focus essentially on the logical aspect of query optimization. In this paper, we propose a formalism (i) that unifies different rewriting formalisms, (ii) that allows easy and exhaustive factorization of duplicated subqueries, and (iii) that supports heuristics in order to reduce the optimization rewriting phase. SIGMOD Conference Scientific Data Management: Real-World Issues and Requirements. Paula J. Cowley 1992 Scientific Data Management: Real-World Issues and Requirements. SIGMOD Conference DOODLE: A Visual Language for Object-Oriented Databases. Isabel F. Cruz 1992 In this paper we introduce DOODLE, a new visual and declarative language for object-oriented databases. The main principle behind the language is that it is possible to display and query the database with arbitrary pictures. We allow the user to tailor the display of the data to suit the application at hand or her preferences. We want the user-defined visualizations to be stored in the database, and the language to express all kinds of visual manipulations. For extendibility reasons, the language is object-oriented. The semantics of the language is given by a well-known deductive query language for object-oriented databases. We hope that the formal basis of our language will contribute to the theoretical study of database visualizations and visual query languages, a subject that we believe is of great interest, but largely left unexplored. SIGMOD Conference Performance Analysis of Coherency Control Policies through Lock Retention. Asit Dan,Philip S. Yu 1992 Buffer coherency control can be achieved through retaining a lock (shared, exclusive, etc.) on each page in the buffer, even after the requesting transaction has committed. Depending upon the lock mode held for retention and the compatibility of lock modes specified, different retention policies can be devised. In addition to tracking the validity of the buffered data granules, additional capabilities can be provided such as deferred writes to support no-force policy on commit, (node) location identification of valid granules to support remote memory accesses, and shared/exclusive lock retention to reduce the number of global lock requests for concurrency control. However, these can have serious implications not only on the performance but also on the recovery complexity. In this paper, five different integrated coherency policies are considered. We classify these policies into three different categories according to their recovery requirements. A performance study based on analytic models is provided to understand the trade-offs on both maximum throughputs and response times of the policies with a similar level of recovery complexity and the performance gain achievable through increasing the level of recovery complexity. SIGMOD Conference Parallel Index Building in Informix OnLine 6.0. Wayne Davison 1992 Parallel Index Building in Informix OnLine 6.0. SIGMOD Conference A Concurrency Model for Transaction Management. Marc Descollonges 1992 A Concurrency Model for Transaction Management. SIGMOD Conference "Access to Data in NASA's Earth Observing System." Jeff Dozier 1992 "Access to Data in NASA's Earth Observing System." SIGMOD Conference Crash Recovery in Client-Server EXODUS. Michael J. Franklin,Michael J. Zwilling,C. K. Tan,Michael J. Carey,David J. DeWitt 1992 In this paper, we address the correctness and performance issues that arise when implementing logging and crash recovery in a page-server environment. The issues result from two characteristics of page-server systems: 1) the fact that data is modified and cached in client database buffers that are not accessible by the server, and 2) the performance and cost trade-offs that are inherent in a client-server environment. We describe a recovery system that we have implemented for the client-server version of the EXODUS storage manager. The implementation supports efficient buffer management policies, allows flexibility in the interaction between clients and the server, and reduces the server load by generating log records at clients. We also present a preliminary performance analysis of the implementation. SIGMOD Conference Query Optimization for Parallel Execution. Sumit Ganguly,Waqar Hasan,Ravi Krishnamurthy 1992 The decreasing cost of computing makes it economically viable to reduce the response time of decision support queries by using parallel execution to exploit inexpensive resources. This goal poses the following query optimization problem: Minimize response time subject to constraints on throughput, which we motivate as the dual of the traditional DBMS problem. We address this novel problem in the context of Select-Project-Join queries by extending the execution space, cost model and search algorithm that are widely used in commercial DBMSs. We incorporate the sources and deterrents of parallelism in the traditional execution space. We show that a cost model can predict response time while accounting for the new aspects due to parallelism. We observe that the response time optimization metric violates a fundamental assumption in the dynamic programming algorithm that is the linchpin in the optimizers of most commercial DBMSs. We extend dynamic programming and show how optimization metrics which correctly predict response time may be designed. SIGMOD Conference Database and Transaction Processing Benchmarks. Jim Gray 1992 Database and Transaction Processing Benchmarks. SIGMOD Conference Event Specification in an Active Object-Oriented Database. Narain H. Gehani,H. V. Jagadish,Oded Shmueli 1992 The concept of a trigger is central to any active database. Upon the occurrence of a trigger event, the trigger is “fired”, i.e, the trigger action is executed. We describe a model and a language for specifying basic and composite trigger events in the context of an object-oriented database. The specified events can be detected efficiently using finite automata. We integrate our model with O++, the database programming language for the ode object database being developed at AT&T Bell Labs. We propose a new Event-Action model, which folds into the event specification the condition part of the well-known Event-Condition-Action model and avoids the multiple coupling modes between the event, condition, and action trigger components. SIGMOD Conference PRIMA - A Database System Supporting Dynamically Defined Composite Objects. Michael Gesmann,Andreas Grasnickel,Theo Härder,Christoph Hübel,Wolfgang Käfer,Bernhard Mitschang,Harald Schöning 1992 PRIMA - A Database System Supporting Dynamically Defined Composite Objects. SIGMOD Conference A Performance Analysis of Alternative Multi-Attribute Declustering Strategies. Shahram Ghandeharizadeh,David J. DeWitt,Waheed Qureshi 1992 "During the past decade, parallel database systems have gained increased popularity due to their high performance, scalability and availability characteristics. With the predicted future database sizes and the complexity of queries, the scalability of these systems to hundreds and thousands of processors is essential for satisfying the projected demand. Several studies have repeatedly demonstrated that both the performance and scalability of a paralel database system is contingent on the physical layout of data across the processors of the system. If the data is not declustered properly, the execution of an operator might waste resources, reducing the overall processing capability of the system. With earlier, single attribute declustering strategies, such as those found in Tandem, Teradata, Gamma, and Bubba parallel database systems, a selection query including a range predicate on any attribute other than the partitioning attribute must be sent to all processors containing tuples of the relation. By directing a query with minimal resource requirements to processors that contain no relevant tuples, the system wastes CPU cycles, communication bandwidth, and I/O bandwidth, reducing its overall processing capability. As a solution, several multi-attribute declustering strategies have been proposed. However, the performance of these declustering techniques have not previously been compared to one another nor with a single attribute partitioning strategy. This paper, compares the performance of Multi-Attribute GrId deClustering (MAGIC) strategy and Bubba's Extended Range Declustering (BERD) strategy with one another and with the range partitioning strategy. Our results indicate that MAGIC outperforms both range and BERD in all experiments conducted in this study." SIGMOD Conference Sequential Sampling Procedures for Query Size Estimation. Peter J. Haas,Arun N. Swami 1992 We provide a procedure, based on random sampling, for estimation of the size of a query result. The procedure is sequential in that sampling terminates after a random number of steps according to a stopping rule that depends upon the observations obtained so far. Enough observations are obtained so that, with a pre-specified probability, the estimate differs from the true size of the query result by no more than a prespecified amount. Unlike previous sequential estimation procedures for queries, our procedure is asymptotically efficient and requires no ad hoc pilot sample or a a priori assumptions about data characteristics. In addition to establishing the asymptotic properties of the estimation procedure, we provide techniques for reducing undercoverage at small sample sizes and show that the sampling cost of the procedure can be reduced through stratified sampling techniques. SIGMOD Conference Rule Condition Testing and Action Execution in Ariel. Eric N. Hanson 1992 "This paper describes testing of rule conditions and execution of rule actions in Ariel active DBMS. The Ariel rule system is tightly coupled with query and update processing. Ariel rules can have conditions based on a mix of patterns, events, and transitions. For testing rule conditions, Ariel makes use of a discrimination network composed of a special data structure for testing single-relation selection conditions efficiently, and a modified version of the TREAT algorithm, called A-TREAT, for testing join conditions. The key modification to TREAT (which could also be used in the Rete algorithm) is the use of virtual &agr;-memory nodes which save storage since they contain only the predicate associated with the memory node instead of copies of data matching the predicate. The rule-action executor in Ariel binds the data matching a rule's condition to the action of the rule at rule fire time, and executes the rule action using the query processor." SIGMOD Conference A High Performance Multiversion Concurrency Control Protocol of Object Databases. Craig Harris,Madhu Reddy,Carl Woolf 1992 A High Performance Multiversion Concurrency Control Protocol of Object Databases. SIGMOD Conference A Qualitative Comparison Study of Data Structures for Large Line Segment Databases. Erik G. Hoel,Hanan Samet 1992 A qualitative comparative study is performed of the performance of three popular spatial indexing methods - the R-tree, R+-tree, and the PMR quadtree-in the context of processing spatial queries in large line segment databases. The data is drawn from the TIGER/Line files used by the Bureau of the Census to deal with the road networks in the US. The goal is not to find the best data structure as this is not generally possible. Instead, their comparability is demonstrated and an indication is given as to when and why their performance differs. Tests are conducted with a number of large datasets and performance is tabulated in terms of the complexity of the disk activity in building them, their storage requirements, and the complexity of the disk activity for a number of tasks that include point and window queries, as well as finding the nearest line segment to a given point and an enclosing polygon. SIGMOD Conference Exploiting Inter-Operation Parallelism in XPRS. Wei Hong 1992 In this paper, we study the scheduling and optimization problems of parallel query processing using interoperation parallelism in a shared-memory environment and propose our solutions for XPRS. We first study the scheduling problem of a set of a continuous sequence of independent tasks that are either from a bushy tree plan of a single query or from the plans of multiple queries, and present a clean and simple scheduling algorithm. Our scheduling algorithm achieves maximum resource utilizations by running an IO-bound task and a CPU-bound task in parallel with carefully calculated degrees of parallelism and maintains the maximum resource utilizations by dynamically adjusting the degrees of parallelism of running tasks whenever necessary. Real performance figures are shown to confirm the effectiveness of our scheduling algorithm. We also revisit the optimization problem of parallel execution plans of a single query and extend our previous results to consider inter-operation parallelism by introducing a new cost estimation method to the query optimizer based on our scheduling algorithm. SIGMOD Conference Analysis of Recovery in a Database System Using a Write-Ahead Log Protocol. Anant Jhingran,Pratap Khedkar 1992 In this paper we examine the recovery time in a database system using a Write-Ahead Log protocol, such as ARIES [9], under the assumption that the buffer replacement policy is strict LRU. In particular, analytical equations for log read time, data I/O, log application, and undo processing time are presented. Our initial model assumes a read/write ratio of one, and a uniform access pattern. This is later generalized to include different read/write ratios, as well as a “hot set” model (i.e. x% of the accesses go to y% of the data). We show that in the uniform access model, recovery is dominated by data I/O costs, but under extreme hot-set conditions, this may no longer be true. Furthermore, since we derive anaytical equations, recovery can be analyzed for any set of parameter conditions not discussed here. SIGMOD Conference USD - A Database Management System for Scientific Research. Rowland R. Johnson,Mandy Goldner,Mitch Lee,Keith McKay,Robert Shectman,John Woodruff 1992 USD - A Database Management System for Scientific Research. SIGMOD Conference Realizing a Temporal Complex-Object Data Model. Wolfgang Käfer,Harald Schöning 1992 Support for temporal data continues to be a requirement posed by many applications such as VLSI design and CAD, but also in conventional applications like banking and sales. Furthermore, the strong demand for complex-object support is known as an inherent fact in design applications, and also emerges for advance “conventional” applications. Thus, new advanced database management systems should include both features, i.e. should support temporal complex-objects. In this paper, we present such a temporal complex-object data model. The central notion of our temporal complex-object data model is a time slice, representing one state of a complex object. We explain the mapping of time slices onto the complex objects supported by the MAD model (which we use for an example of a non-temporal complex-object data model) as well as the transformation process of operations on temporal complex-objects into MAD model operations. Thereby, the basic properties of the MAD model are a prerequisite for our approach. For example, time slices can only be directly stored, if non-disjunct (i.e. over-lapping) complex objects are easily handled in the underlying complex-object data model. SIGMOD Conference Parallel R-trees. Ibrahim Kamel,Christos Faloutsos 1992 We consider the problem of exploiting parallelism to accelerate the performance of spacial access methods and specifically, R-trees [11]. Our goal is to design a server for spatial data, so that to maximize the throughput of range queries. This can be achieved by (a) maximizing parallelism for large range queries, and (b) by engaging as few disks as possible on point queries [22]. We propose a simple hardware architecture consisting of one processor with several disks attached to it. On this architecture, we propose to distribute the nodes of a traditonal R-tree, with cross-disk pointers (“Multiplexed” R-tree). The R-tree code is identical to the one for a single-disk R-tree, with the only addition that we have to decide which disk a newly created R-tree node should be stored in. We propose and examine several criteria to choose a disk for a new node. The most successful one, termed “proximity index” or PI, estimates the similarity of the new node with the other R-tree nodes already on a disk, and chooses the disk with the lowest similarity. Experimental results show that our scheme consistently outperforms all the other heuristics for node-to-disk assignments, achieving up to 55% gains over the Round Robin one. Experiments also indicate that the multiplexed R-tree with PI heuristic gives better response time than the disk-stripping (=“Super-node”) approach, and imposes lighter load on the I/O sub-system. The speed up of our method is close to linear speed up, increasing with the size of the queries. SIGMOD Conference High Performance and Availability Through Data Distribution. Jay Kasi 1992 High Performance and Availability Through Data Distribution. SIGMOD Conference Querying Object-Oriented Databases. Michael Kifer,Won Kim,Yehoshua Sagiv 1992 Querying Object-Oriented Databases. SIGMOD Conference Optimization of Object-Oriented Recursive Queries using Cost-Controlled Strategies. Rosana S. G. Lanzelotte,Patrick Valduriez,Mohamed Zaït 1992 Object-oriented data models are being extended with recursion to gain expressive power. This complicates the optimization problem which has to deal with recursive queries on complex objects. Because unary operations invoking methods or path expressions on objects may be costly to execute, traditional heuristics for optimizing recursive queries are no longer valid. In this paper we propose a cost-based optimization method which handles object-oriented recursive queries. In particular, it is able to delay the decision of pushing selective operations through recursion until the effect of such a transformation can be measured by a cost model. The approach integrates rewriting and increases the optimization opportunities for recursive queries on objects while allowing for efficient optimization. SIGMOD Conference MLR: A Recovery Method for Multi-level Systems. David B. Lomet 1992 To achieve high concurrency in a database system has meant building a system that copes well with important special cases. Recent work on multi-level systems suggest a systematic path to high concurrency. A multi-level system using locks permits restrictive low level locks of a subtransaction to be replaced with less restrictive high level locks when sub-transactions commit, enhancing concurrency. This is possible because sub-transactions can be undone via high level compensation actions rather than by restoring a prior lower level state. We describe a recovery scheme, called Multi-Level Recovery (MLR) that logs this high level undo operation with the commit record for the subtransaction that it compensates, posting log records to only a single log. A variant of the method copes with nested transactions, and both nested and multi-level transactions can be treated in a unified fashion. SIGMOD Conference Access Method Concurrency with Recovery. David B. Lomet,Betty Salzberg 1992 Providing high concurrency in B+-trees has been studied extensively. But few efforts have been documented for combining concurrency methods with a recovery scheme that preserves well-formed trees across system crashes. We describe an approach for this that works for a class of index trees that is a generalization of the Blink-tree. A major feature of our method is that it works with a range of different recovery methods. It achieves this by decomposing structure changes in an index tree into a sequence of atomic actions, each one leaving the tree well-formed and each working on a separate level of the tree. All atomic actions on levels of the tree above the leaf level are independent of database transactions, and so are of short duration. SIGMOD Conference A Transformation-Based Approach to Optimizing Loops in Database Programming Languages. Daniel F. Lieuwen,David J. DeWitt 1992 Database programming languages like O2, E, and O++ include the ability to iterate through a set. Nested iterators can be used to express joins. This paper describes compile-time optimizations similar to relational transformations like join reordering for such programming constructs. This paper also shows how to use a standard transformation-based optimizer to optimize these joins. An optimizer built using the EXODUS Optimizer Generator [GRAE87] was added to the Bell Labs O++ [AGRA89] compiler. We used the resulting optimizing compiler to experimentally validate the ideas in this paper. The experiments show that this technique can significantly improve the performance of database programming languages. SIGMOD Conference H-trees: A Dynamic Associative Search Index for OODB. Chee Chin Low,Beng Chin Ooi,Hongjun Lu 1992 The support of the superclass-subclass concept in object-oriented databases (OODB) makes an instance of a subclass also an instance of its superclass. As a result, the access scope of a query against a class in general includes the access scope of all its subclasses, unless specified otherwise. To support the superclass-subclass relationship efficiently, the index must achieve two objectives. First, the index must support efficient retrieval of instances from a single class. Second, it must also support efficient retrieval of instances from classes in a hierarchy of classes. In this paper, we propose a new index called the H-tree that supports efficient retrieval of instances of a single class as well as retrieval of instances of a class and its subclasses. The unique feature of H-trees is that they capture the superclass-subclass relationships. A performance analysis is conducted and both experimental and analytical results indicate that the H-tree is an efficient indexing structure for OODB. SIGMOD Conference The Concurrency Control Problem in Multidatabases: Characteristics and Solutions. Sharad Mehrotra,Rajeev Rastogi,Yuri Breitbart,Henry F. Korth,Abraham Silberschatz 1992 A Multidatabase System (MDBS) is a collection of local database management systems, each of which may follow a different concurrency control protocol. This heterogeneity makes the task of ensuring global serializability in an MDBS environment difficult. In this paper, we reduce the problem of ensuring global serializability to the problem of ensuring serializability in a centralized database system. We identify characteristics of the concurrency control problem in an MDBS environment, and additional requirements on concurrency control schemes for ensuring global serializability. We then develop a range of concurrency control schemes that ensure global serializability in an MDBS environment, and at the same time meet the requirements. Finally, we study the tradeoffs between the complexities of the various schemes and the degree of concurrency provided by each of them. SIGMOD Conference The Sybase Open Server. Paul Melmom 1992 The Sybase Open Server. SIGMOD Conference DIRECT: A Query Facility for Multiple Databases. Ulla Merz,Roger King 1992 The subject of this research project is the architecture and design of a multidatabase query facility. These databases contain structured data, typical for business applications. Problems addressed are: presenting a uniform interface for retrieving data from multiple databases, providing autonomy for the component databases, and defining an architecture for semantic services.DIRECT is a query facility for heterogeneous databases. The databases and their definitions can differ in their data models, names, types, and encoded values. Instead of creating a global schema, descriptions of different databases are allowed to coexist. A multidatabase query language provides a uniform interface for retrieving data from different databases. DIRECT has been exercised with operational databases that are part of an automated business system. SIGMOD Conference ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging. C. Mohan,Frank E. Levine 1992 This paper provides a comprehensive treatment of index management in transaction systems. We present a method, called ARIESIIM (Algorithm for Recovery and Isolation Exploiting Semantics for Index Management), for concurrency control and recovery of B+-trees. ARIES/IM guarantees serializability and uses write-ahead logging for recovery. It supports very high concurrency and good performance by (1) treating as the lock of a key the same lock as the one on the corresponding record data in a data page (e.g., at the record level), (2) not acquiring, in the interest of permitting very high concurrency, commit duration locks on index pages even during index structure modification operations (SMOs) like page splits and page deletions, and (3) allowing retrievals, inserts, and deletes to go on concurrently with SMOs. During restart recovery, any necessary redos of index changes are always performed in a page-oriented fashion (i.e., without traversing the index tree) and, during normal processing and restart recovery, whenever possible undos are performed in a page-oriented fashion. ARIES/IM permits different granularities of locking to be supported in a flexible manner. A subset of ARIES/IM has been implemented in the OS/2 Extended Edition Database Manager. Since the locking ideas of ARIES/IM have general applicability, some of them have also been implemented in SQL/DS and the VM Shared File System, even though those systems use the shadow-page technique for recovery. SIGMOD Conference Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates. C. Mohan,Inderpal Narang 1992 "As relational DBMSs become more and more popular and as organizations grow, the sizes of individual tables are increasing dramatically. Unfortunately, current DBMSs do not allow updates to be performed on a table while an index (e.g., a B+-tree) is being built for that table, thereby decreasing the systems' availability. This paper describes two algorithms in order to relax this restriction. Our emphasis has been to maximize concurrency, minimize overheads and cover all aspects of the problem. Builds of both unique and nonunique indexes are handled correctly. We also describe techniques for making the index-build operations restartable, without loss of all work, in case a system failure were to interrupt the completion of the creation of the index. In this connection, we also present algorithms for making a long sort of operation restartable. These include algorithms for the sort and merge phases of sorting." SIGMOD Conference Efficient and Flexible Methods for Transient Versioning of Records to Avoid Locking by Read-Only Transactions. C. Mohan,Hamid Pirahesh,Raymond A. Lorie 1992 We present efficient and flexible methods which permit read-only transactions that do not mind reading a possibly slightly old, but still consistent, version of the data base to execute without acquiring locks. This approach avoids the undesirable interferences between such queries and the typically shorter update transactions that cause unnecessary and costly delays. Indexed access by such queries is also supported, unlike by the earlier methods. Old versions of records are maintained only in a transient fashion. Our methods are characterized by their flexibility (number of versions maintained and the timing of version switches, supporting partial rollbacks, and different recovery and buffering methods) and their efficiency (logging, garbage collection, version selection, and incremental, record-level versioning). Distributed data base environments are also supported, including commit protocols with the read-only optimization. We also describe efficient methods for garbage collecting unneeded older versions. SIGMOD Conference Multi-vendor Interoperability Through SQL Access. Scott Newmann 1992 Multi-vendor Interoperability Through SQL Access. SIGMOD Conference Architectures for Object Data Management. Jack A. Orenstein 1992 Architectures for Object Data Management. SIGMOD Conference Query Processing in the ObjectStore Database System. Jack A. Orenstein,Sam Haradhvala,Benson Margulies,Don Sakahara 1992 ObjectStore is an object-oriented database system supporting persistence orthogonal to type, transaction management, and associative queries. Collections are provided as objects. The data model is non-1NF, as objects may have embedded collections. Queries are integrated with the host language in the form of query operators whose operands are a collection and a predicate. The predicate may itself contain a (nested) query operating on an embedded collection. Indexes on paths may be added and removed dynamically. Collections, being treated as objects, may be referred to indirectly, e.g., through a by-reference argument. For this reason and others, multiple execution strategies are generated, and a final selection is made just prior to query execution. Nested queries can result in interleaved execution and strategy selection. SIGMOD Conference Improving Fault Tolerance and Supporting Partial Writes in Structured Coterie Protocols for Replicated Objects. Michael Rabinovich,Edward D. Lazowska 1992 This paper presents a new technique for efficiently controlling replicas in distributed systems. Conventional structured coterie protocols are efficient but incur a penalty of reduced availability in exchange for the performance gain. Further, the performance advantage can only be fully realized when write operations always replace the old data item with the new value instead of updating a portion of the data item. Our new approach significantly improves availability while allowing partial write operations. After presenting our general approach, we apply it to an existing structured coterie protocol and analyze the availability of the resulting protocol. We also show that other classes of protocols can make use of our approach. SIGMOD Conference Performance Evaluation of Extended Storage Architectures for Transaction Processing. Erhard Rahm 1992 The use of non-volatile semiconductor memory within an extended storage hierarchy promises significant performance improvements for transaction processing. Although page-addressable semiconductor memories like extended memory, solid-state disks and disk caches are commercially available since several years, no detailed investigation of their use for transaction processing has been performed so far. We present a comprehensive simulation study that compares the performance of these storage types and of different usage forms. The following usage forms are considered: allocation of entire log and database files in non-volatile semiconductor memory, using a so-called write buffer to perform disk writes asynchronously, and caching of database pages at intermediate storage levels (in addition to main memory caching). Simulation results will be presented for the debit-credit workload frequently used in transaction processing benchmarks. SIGMOD Conference Extensible/Rule Based Query Rewrite Optimization in Starburst. Hamid Pirahesh,Joseph M. Hellerstein,Waqar Hasan 1992 This paper describes the Query Rewrite facility of the Starburst extensible database system, a novel phase of query optimization. We present a suite of rewrite rules used in Starburst to transform queries into equivalent queries for faster execution, and also describe the production rule engine which is used by Starburst to choose and execute these rules. Examples are provided demonstrating that these Query Rewrite transformations lead to query execution time improvements of orders of magnitude, suggesting that Query Rewrite in general—and these rewrite rules in particular—are an essential step in query optimization for modern database systems. SIGMOD Conference Evaluation of Remote Backup Algorithms for Transaction Processing Systems. Christos A. Polyzois,Hector Garcia-Molina 1992 A remote backup is a copy of a primary database maintained at a geographically separate location and is used to increase data availability. Remote backup systems are typically log-based and can be classified into 2-safe and 1-safe, depending on whether transactions commit at both sites simultaneously or they first commit at the primary and are later propagated to the backup. We have built an experimental database system on which we evaluated the performance of the epoch algorithm, a 1-safe algorithm we have developed, and compared it with the 2-safe approach under various conditions. We also report on the use of multiple log streams to propagate information from the primary to the backup. SIGMOD Conference Rdb/VMS Support for Multi-media Databases. T. K. Rengarajan 1992 Rdb/VMS Support for Multi-media Databases. SIGMOD Conference Administration, Availability, and Development Features of Teradata. Bill Robertson 1992 Administration, Availability, and Development Features of Teradata. SIGMOD Conference What Can We Do to Strengthen the Connection Between Theory and System Builders. Arnon Rosenthal 1992 What Can We Do to Strengthen the Connection Between Theory and System Builders. SIGMOD Conference Simple Rational Guidance for Chopping Up Transactions. Dennis Shasha,Eric Simon,Patrick Valduriez 1992 Chopping transactions into pieces is good for performance but may lead to non-serializable executions. Many researchers have reacted to this fact by either inventing new concurrency control mechanisms, weakening serializability, or both. We adopt a different approach. We assume a user who • has only the degree 2 and degree 3 consistency options offered by the vast majority of conventional database systems; and •knows the set of transactions that may run during a certain interval (users are likely to have such knowledge for online or real-time transactional applications). Given this information, our algorithm finds the finest partitioning of a set of transactions TranSet with the following property; if the partitioned transactions execute serializably, then TranSet executes serializably. This permits users to obtain more concurrency while preserving correctness. Besides obtaining more inter-transaction concurrency, chopping transactions in this way can enhance intra-transaction parallelism. The algorithm is inexpensive, running in O(n x (e + m)) time using a naive implementation where n is the number of edges in the conflict graph among the transactions, and m is the maximum number of accesses of any transaction. This makes it feasible to add as a tuning knob to practical systems. SIGMOD Conference Compensation-Based On-Line Query Processing. V. Srinivasan,Michael J. Carey 1992 "It is well known that using conventional concurrency control techniques for obtaining serializable answers to long-running queries leads to an unacceptable drop in system performance. As a result, most current DBMSs execute such queries under a reduced degree of consistency, thus providing non-serializable answers. In this paper, we present a new and highly concurrent approach for processing large decision support queries in relational databases. In this new approach, called compensation-based query processing, concurrent updates to any data participating in a query are communicated to the query's on-line query processor, which then compensates for these updates so that the final answer reflects changes caused by the updates. Very high concurrency is achieved by locking data only briefly, while still delivering transaction-consistent answers to queries." SIGMOD Conference Continuous Queries over Append-Only Databases. Douglas B. Terry,David Goldberg,David A. Nichols,Brian M. Oki 1992 In a database to which data is continually added, users may wish to issue a permanent query and be notified whenever data matches the query. If such continuous queries examine only single records, this can be implemented by examining each record as it arrives. This is very efficient because only the incoming record needs to be scanned. This simple approach does not work for queries involving joins or time. The Tapestry system allows users to issue such queries over a database of mail and bulletin board messages. The user issues a static query, such as “show me all messages that have been replied to by Jones,” as though the database were fixed and unchanging. Tapestry converts the query into an incremental query that efficiently finds new matches to the original query as new messages are added to the database. This paper describes the techniques used in Tapestry, which do not depend on triggers and thus be implemented on any commercial database that supports SQL. Although Tapestry is designed for filtering mail and news messages, its techniques are applicable to any append-only database. SIGMOD Conference On the Performance of Object Clustering Techniques. Manolis M. Tsangaris,Jeffrey F. Naughton 1992 We investigate the performance of some of the best-known object clustering algorithms on four different workloads based upon the tektronix benchmark. For all four workloads, stochastic clustering gave the best performance for a variety of performance metrics. Since stochastic clustering is computationally expensive, it is interesting that for every workload there was at least one cheaper clustering algorithm that matched or almost matched stochastic clustering. Unfortunately, for each workload, the algorithm that approximated stochastic clustering was different. Our experiments also demonstrated that even when the workload and object graph are fixed, the choice of the clustering algorithm depends upon the goals of the system. For example, if the goal is to perform well on traversals of small portions of the database starting with a cold cache, the important metric is the per-traversal expansion factor, and a well-chosen placement tree will be nearly optimal; if the goal is to achieve a high steady-state performance with a reasonably large cache, the appropriate metric is the number of pages to which the clustering algorithm maps the active portion of the database. For this metric, the PRP clustering algorithm, which only uses access probabilities achieves nearly optimal performance. SIGMOD Conference Full Distribution in Objectivity/DB. Andrew E. Wade 1992 Full Distribution in Objectivity/DB. SIGMOD Conference Experience from a Real Life Query Optimizer. Yun Wang 1992 Experience from a Real Life Query Optimizer. VLDB An Interval Classifier for Database Mining Applications. Rakesh Agrawal,Sakti P. Ghosh,Tomasz Imielinski,Balakrishna R. Iyer,Arun N. Swami 1992 An Interval Classifier for Database Mining Applications. VLDB Using Flexible Transactions to Support Multi-System Telecommunication Applications. Mansoor Ansari,Linda Ness,Marek Rusinkiewicz,Amit P. Sheth 1992 Using Flexible Transactions to Support Multi-System Telecommunication Applications. VLDB Random Sampling from Pseudo-Ranked B+ Trees. Gennady Antoshenkov 1992 Random Sampling from Pseudo-Ranked B+ Trees. VLDB Resilient Logical Structures for Efficient Management of Replicated Data. Divyakant Agrawal,Amr El Abbadi 1992 Resilient Logical Structures for Efficient Management of Replicated Data. VLDB An Extended Relational Database Model for Uncertain and Imprecise Information. Suk Kyoon Lee 1992 An Extended Relational Database Model for Uncertain and Imprecise Information. VLDB Multiversion Query Locking. Paul M. Bober,Michael J. Carey 1992 Multiversion Query Locking. VLDB Data Management for Real-Time Systems. Alejandro P. Buchmann 1992 Data Management for Real-Time Systems. VLDB A Conceptual Model for Dynamic Clustering in Object Databases. Qing Li,John L. Smith 1992 A Conceptual Model for Dynamic Clustering in Object Databases. VLDB Production Rules in Parallel and Distributed Database Environments. Stefano Ceri,Jennifer Widom 1992 Production Rules in Parallel and Distributed Database Environments. VLDB The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource Mangers Using Atomic Commitment. Yoav Raz 1992 The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous Environment of Multiple Autonomous Resource Mangers Using Atomic Commitment. VLDB Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins. Ming-Syan Chen,Ming-Ling Lo,Philip S. Yu,Honesty C. Young 1992 Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins. VLDB Dynamic Data Distribution (D) in a Shared-Nothing Multiprocessor Data Store. Donald D. Chamberlin,Frank B. Schmuck 1992 Dynamic Data Distribution (D) in a Shared-Nothing Multiprocessor Data Store. VLDB Extensible Buffer Management of Indexes. Chee Yong Chan,Beng Chin Ooi,Hongjun Lu 1992 Extensible Buffer Management of Indexes. VLDB A Temporal Evolutionary Object-Oriented Data Model and Its Query Language for Medical Image Management. Wesley W. Chu,Ion Tim Ieong,Ricky K. Taira,Claudine M. Breant 1992 A Temporal Evolutionary Object-Oriented Data Model and Its Query Language for Medical Image Management. VLDB Query Optimization in a Heterogeneous DBMS. Weimin Du,Ravi Krishnamurthy,Ming-Chien Shan 1992 Query Optimization in a Heterogeneous DBMS. VLDB A Uniform Approach to Processing Temporal Queries. Umeshwar Dayal,Gene T. J. Wuu 1992 A Uniform Approach to Processing Temporal Queries. VLDB Practical Skew Handling in Parallel Joins. David J. DeWitt,Jeffrey F. Naughton,Donovan A. Schneider,S. Seshadri 1992 Practical Skew Handling in Parallel Joins. VLDB Performance and Scalability of Client-Server Database Architectures. Alex Delis,Nick Roussopoulos 1992 Performance and Scalability of Client-Server Database Architectures. VLDB On B-Tree Indices for Skewed Distributions. Christos Faloutsos,H. V. Jagadish 1992 On B-Tree Indices for Skewed Distributions. VLDB Composite Event Specification in Active Databases: Model & Implementation. Narain H. Gehani,H. V. Jagadish,Oded Shmueli 1992 Composite Event Specification in Active Databases: Model & Implementation. VLDB Global Memory Management in Client-Server Database Architectures. Michael J. Franklin,Michael J. Carey,Miron Livny 1992 Global Memory Management in Client-Server Database Architectures. VLDB Incomplete Information in Relational Temporal Databases. Shashi K. Gadia,Sunil S. Nair,Yiu-Cheong Poon 1992 Incomplete Information in Relational Temporal Databases. VLDB Locking and Latching in a Memory-Resident Database System. Vibby Gottemukkala,Tobin J. Lehman 1992 Locking and Latching in a Memory-Resident Database System. VLDB Knowledge Discovery in Databases: An Attribute-Oriented Approach. Jiawei Han,Yandong Cai,Nick Cercone 1992 Knowledge Discovery in Databases: An Attribute-Oriented Approach. VLDB Experiences With an Object Manager for a Process-Centered Environment. Dennis Heimbigner 1992 Experiences With an Object Manager for a Process-Centered Environment. VLDB Querying in Highly Mobile Distributed Environments. Tomasz Imielinski,B. R. Badrinath 1992 Querying in Highly Mobile Distributed Environments. VLDB Parametric Query Optimization. Yannis E. Ioannidis,Raymond T. Ng,Kyuseok Shim,Timos K. Sellis 1992 In most database systems, the values of many important run-time parameters of the system, the data, or the query are unknown at query optimization time. Parametric query optimization attempts to identify at compile time several execution plans, each one of which is optimal for a subset of all possible values of the run-time parameters. The goal is that at run time, when the actual parameter values are known, the appropriate plan should be identifiable with essentially no overhead. We present a general formulation of this problem and study it primarily for the buffer size parameter. We adopt randomized algorithms as the main approach to this style of optimization and enhance them with a sideways information passing feature that increases their effectiveness in the new task. Experimental results of these enhanced algorithms show that they optimize queries for large numbers of buffer sizes in the same time needed by their conventional versions for a single buffer size, without much sacrifice in the output quality and with essentially zero run-time overhead. VLDB Integrity Maintenance in Object-Oriented Databases. H. V. Jagadish,Xiaolei Qian 1992 Integrity Maintenance in Object-Oriented Databases. VLDB Proclamation-Based Model for Cooperating Transactions. H. V. Jagadish,Oded Shmueli 1992 Proclamation-Based Model for Cooperating Transactions. VLDB Optimizing Boolean Expressions in Object-Bases. Alfons Kemper,Guido Moerkotte,Michael Steinbrunn 1992 Optimizing Boolean Expressions in Object-Bases. VLDB Updates in a Rule-Based Language for Objects. Michael Kramer,Georg Lausen,Gunter Saake 1992 Updates in a Rule-Based Language for Objects. VLDB High Throughput Escrow Algorithms for Replicated Databases. Narayanan Krishnakumar,Arthur J. Bernstein 1992 High Throughput Escrow Algorithms for Replicated Databases. VLDB Temporal Query Processing and Optimization in Multiprocessor Database Machines. T. Y. Cliff Leung,Richard R. Muntz 1992 Temporal Query Processing and Optimization in Multiprocessor Database Machines. VLDB CMD: A Multidimensional Declustering Method for Parallel Data Systems. Jianzhong Li,Jaideep Srivastava,Doron Rotem 1992 CMD: A Multidimensional Declustering Method for Parallel Data Systems. VLDB Activity Model: A Declarative Approach for Capturing Communication Behavior in Object-Oriented Databases. Ling Liu,Robert Meersman 1992 Activity Model: A Declarative Approach for Capturing Communication Behavior in Object-Oriented Databases. VLDB Performance Evaluation of an Adaptive and Robust Load Control Method for the Avoidance of Data-Contention Thrashing. Axel Mönkeberg,Gerhard Weikum 1992 Performance Evaluation of an Adaptive and Robust Load Control Method for the Avoidance of Data-Contention Thrashing. VLDB Database Technology for Reliable Systems: Issues, Impact, and Approaches (Panel). Matthew Morgenstern 1992 Database Technology for Reliable Systems: Issues, Impact, and Approaches (Panel). VLDB Improved Unnesting Algorithms for Join Aggregate SQL Queries. M. Muralikrishna 1992 Improved Unnesting Algorithms for Join Aggregate SQL Queries. VLDB Software Repositories. John Mylopoulos,Thomas Rose 1992 Software Repositories. VLDB "Georgraphic Information Systems, A Challenge for the 90's (Panel)." Ekow J. Otoo,Ron Lake,Wo-Shun Luk,T. H. Merrett,Hanan Samet 1992 "Georgraphic Information Systems, A Challenge for the 90's (Panel)." VLDB SVP: A Model Capturing Sets, Lists, Streams, and Parallelism. Douglas Stott Parker Jr.,Eric Simon,Patrick Valduriez 1992 SVP: A Model Capturing Sets, Lists, Streams, and Parallelism. VLDB An Information-Retrieval Approach for Image Databases. Fausto Rabitti,Pasquale Savino 1992 An Information-Retrieval Approach for Image Databases. VLDB CORAL - Control, Relations and Logic. Raghu Ramakrishnan,Divesh Srivastava,S. Sudarshan 1992 CORAL - Control, Relations and Logic. VLDB A Multi-Resolution Relational Data Model. Robert L. Read,Donald S. Fussell,Abraham Silberschatz 1992 The use of data at different levels of information content is essential to the performance of multimedia, scientific, and other large databases because it can significantly decrease I/O and communication costs. The performance advantages of such a multi-resolution scheme can only be fully exploited by a data model that supports the convenient retrieval of data at different levels of information content. In this paper we extend the relational data model to support multi-resolution data retrieval. In particular, we introduce a new partial set construct, called the sandbag, that can support multi-resolution for the types of data used in a wide variety of next-generation database applications, as well as traditional applications. We extend the relational algebra operators to analogous operators on sandbags. The resulting extension of the relational algebra is sound and forms a foundation for future database management systems that support these types of next-generation applications. VLDB Supporting Lists in a Data Model (A Timely Approach). Joel E. Richardson 1992 Supporting Lists in a Data Model (A Timely Approach). VLDB Multiview: A Methodology for Supporting Multiple Views in Object-Oriented Databases. Elke A. Rundensteiner 1992 Multiview: A Methodology for Supporting Multiple Views in Object-Oriented Databases. VLDB Multidatabase Applications: Semantic and System Issues. Marek Rusinkiewicz,Amit P. Sheth 1992 Multidatabase Applications: Semantic and System Issues. VLDB Principles of Transaction-Based On-Line Reorganization. Betty Salzberg,Allyn Dimock 1992 Principles of Transaction-Based On-Line Reorganization. VLDB Spatial Databases. Hanan Samet 1992 Spatial Databases. VLDB Database Tuning. Dennis Shasha,Steve Rozen 1992 Database Tuning. VLDB Implementing High Level Active Rules on Top of a Relational DBMS. Eric Simon,Jerry Kiernan,Christophe de Maindreville 1992 Implementing High Level Active Rules on Top of a Relational DBMS. VLDB Entity Modeling in the MLS Relational Model. Kenneth Smith,Marianne Winslett 1992 Entity Modeling in the MLS Relational Model. VLDB Database Management in the Year 2000: Projections and Star Gazing (Panel). Paul G. Sorenson,Felipe Cariño,Jnan R. Dash,Patricia G. Selinger 1992 Database Management in the Year 2000: Projections and Star Gazing (Panel). VLDB A Method for Change Computation in Deductive Databases. Toni Urpí,Antoni Olivé 1992 A Method for Change Computation in Deductive Databases. VLDB Object-Oriented Database Systems. Patrick Valduriez 1992 Object-Oriented Database Systems. VLDB A Performance Study of Alternative Object Faulting and Pointer Swizzling Strategies. Seth J. White,David J. DeWitt 1992 A Performance Study of Alternative Object Faulting and Pointer Swizzling Strategies. VLDB Parallelism in a Main-Memory DBMS: The Performance of PRISMA/DB. Annita N. Wilschut,Jan Flokstra,Peter M. G. Apers 1992 Parallelism in a Main-Memory DBMS: The Performance of PRISMA/DB. VLDB An Efficient Indexing Technique for Full Text Databases. Justin Zobel,Alistair Moffat,Ron Sacks-Davis 1992 An Efficient Indexing Technique for Full Text Databases. SIGMOD Record Building User Interfaces for Database Applications: The O2 Experience. Patrick Borras,Jean-Claude Mamou,Didier Plateau,Bruno Poyet,Didier Tallot 1992 Building User Interfaces for Database Applications: The O2 Experience. SIGMOD Record The Relational Model contra Entity Relationship? H. W. Buff 1992 The Relational Model contra Entity Relationship? SIGMOD Record An Annotated Bibliography on Object-Orientation and Deduction. Stefan Conrad,Martin Gogolla 1992 This note tries to briefly survey research activities and results on the integration of object-oriented concepts and deductive database languages. SIGMOD Record Visualizing Queries and Querying Visualizations. Mariano P. Consens,Isabel F. Cruz,Alberto O. Mendelzon 1992 Visualizing Queries and Querying Visualizations. SIGMOD Record Advanced Capabilities of the Outer Join. Michael M. David 1992 This paper demonstrates that the modeling of complex data structures can be performed easily and naturally in SQL using the direct outer join operation as defined in the proposed ISO-ANSI SQL2 standard. This paper goes on to demonstrate four advanced capabilities that can be implemented by SQL vendors utilizing the data modeling ability of the outer join. These capabilities are: powerful optimization techniques that can dynamically shorten the access path length; intelligent join view updates that utilize the semantics in the data structure being modeled; direct disparate heterogeneous database access that is transparent and efficient; and automatic conversion of multi-table structures into nested relations allowing for more powerful SQL operations. SIGMOD Record Supporting Display Generation for Complex Database Objects. Belinda B. Flynn,David Maier 1992 Supporting Display Generation for Complex Database Objects. SIGMOD Record Locking Protocols for Concurrency Control in Real-time Database Systems. Sheung-lun Hung,Kam-yiu Lam 1992 Concurrency Control in real-time database systems is complicated by the requirement to maintain database consistency at the same time to minimize the number of transactions missing their deadlines. The scheduling of data in Basic Two Phase Locking (2PL) completely ignores the urgency of a transaction and thus the effectiveness of the adopted real-time resource scheduling protocol is greatly reduced. In Restart based locking protocols (R2PL), same priorities are used for both data and resources scheduling and should have fewer transactions missing their deadlines. However, Restart based protocols sufferred the intrinsic weakness of high restart overhead owing to ensure atomicity of transactions. In this paper, based on their weaknesses, a hybrid concurrency control protocol (H2PL) is proposed. Through performance study, results indicate that it can perform well under different degree of deadline constraint and workload as compared with other real-time locking protocols. SIGMOD Record Advanced User Interfaces for Database Systems, Letter from the Special Issue Editor. Yannis E. Ioannidis 1992 Advanced User Interfaces for Database Systems, Letter from the Special Issue Editor. SIGMOD Record Graphical User Interfaces for the Management of Scientific Experiments and Data. Yannis E. Ioannidis,Miron Livny,Eben M. Haber 1992 It is often stated that the three most important factors that determine the success or failure of a database system are performance, performance, performance! The experience of the last twenty years with relational systems has shown that at least one of these three references to performance implies that of end-users when interacting with the system to access data, i.e., user productivity. Although declarative query languages like SQL and QUEL represent major improvements over procedural programming languages like COBOL, the overall consensus is that they are too complex for many users. The need for more intuitive and easier to learn and use interfaces to database systems is always current. SIGMOD Record A Glossary of Temporal Database Concepts. Christian S. Jensen,James Clifford,Shashi K. Gadia,Arie Segev,Richard T. Snodgrass 1992 This glossary contains concepts specific to temporal databases that are well-defined, well understood, and widely used. In addition to defining and naming the concepts, the glossary also explains the decisions made. It lists competing alternatives and discusses the pros and cons of these. It also includes evaluation criteria for the naming of concepts. This paper is a structured presentation of the results of e-mail discussions initiated during the preparation of the first book on temporal databases, Temporal Databases: Theory, Design, and Implementation, published by Benjamin Cummings, to appear January 1993. Independently of the book, an initiative aimed at designing a consensus Temporal SQL is under way. The paper is a contribution towards establishing common terminology, an initial subtask of this initiative. SIGMOD Record Formal Syntax and Semantics of a Reconstructed Relational Database System. Dan Jonsson 1992 Formal Syntax and Semantics of a Reconstructed Relational Database System. SIGMOD Record "Chair's Message." Won Kim 1992 "Chair's Message." SIGMOD Record Moscow ACM SIGMOD Chapter Established. Leonid A. Kalinichenko 1992 Moscow ACM SIGMOD Chapter Established. SIGMOD Record A Complex Benchmark for Logic Programming and Deductive Databases, or Who Can Beat the N-Queens ? Werner Kießling 1992 The N-queens problem with its long history and inherent complexity is a challenging benchmark target. We present our solution and performance results, hoping that this will stimulate a sort of benchmark competition for tough problems. SIGMOD Record "Chair's Message." Won Kim 1992 "Chair's Message." SIGMOD Record Building Data Representations with FaceKit. Roger King,Michael Novak 1992 Building Data Representations with FaceKit. SIGMOD Record A Suppletment to Sampling-Based Methods for Query Size Estimation in a Database System. Yibei Ling,Wei Sun 1992 Sampling-based methods for estimating relation sizes after relational operators such as selections, joins and projections have been intensively studied in recent years. Methods of this type can achieve high estimation accuracy and efficiency. Since the dominating overhead involved in a sampling-based method is the sampling cost, different variants of sampling methods are proposed so as to minimize the sampling percentage (thus reducing the sampling cost) while maintaining the estimation accuracy in terms of the confidence level and relative error (to be precisely defined later in Section 2). In order to determine the minimal sampling percentage, the overall characteristics of the data such as the mean and variance are needed. Currently, the representative sampling-based methods in literature are based on the assumption that overall characteristics of data are unavailable, and thus a significant amount of effort is dedicated to estimating these characteristics so as to approach the optimal (minimal) sampling percentage. The estimation for these characteristics incurs cost as well as suffers the estimation error. In this short essay, we point out that the exact values of these characteristics of data can be kept track of in a database system at a negligible overhead. As a result, the minimal sampling percentage while ensuring the specified relative error and confidence level can be precisely determined. SIGMOD Record Semantic Optimization: What are Disjunctive Residues Useful for? Wolfgang L. J. Kowarschick 1992 Residues have been proved to be a very important means for doing semantic optimization. In this paper we will discuss a new kind of residues—the disjunctive residues. It will be shown that they are very useful to perform subformula elimination, if, in addition, a powerful reduction algorithm is available. SIGMOD Record The Gist of GIUKU: Graphical Interactive Intelligent Utilities for Knowledgeable Users of Data Base Systems. Michel Kuntz 1992 Synoptic description of GIUKU: its rationale, its main functionalities, its novel features, a comparison to related work, and a discussion of its current status — a fully implemented prototype available for use. SIGMOD Record A Review of Recent Work on Multi-attribute Access Methods. David B. Lomet 1992 Most database systems provide database designers with single attribute indexing capability via some form of B+tree. Multi-attribute search structures are rare, and are mostly found in systems specialized to some more narrow application area, e.g. geographic databases. The reason is that no multi-attirbute search structure has been demonstrated, with high confidence. Multi-attribute search is an active area of research. This paper reviews the state of this field and some of the difficult problems, and reviews some recent notable papers. SIGMOD Record On Global Multidatabase Query Optimization. Hongjun Lu,Beng Chin Ooi,Cheng Hian Goh 1992 On Global Multidatabase Query Optimization. SIGMOD Record Functional Completeness in Object-Oriented Databases. Priti Mishra,Margaret H. Eich 1992 A definition of completeness in the context of Object Oriented Databases (OODBs) is proposed in this paper. It takes into account the existence of various categories of functions in OODBs, each of which must be complete in itself. The functionality of an OODB can be divided into sets of related functions. For example, functions needed to perform all schema evolution operations or all version management operations belong in two distinct sets. Further, each set of functions must include all functions needed to perform all operations defined for that set. Thus, for an OODB to be functionally complete, it must support a certain number of sets (or categories) of functions and each such set must be complete in itself. The purpose of this paper is not to give a precise definition of the categories of functions but rather to define a framework within which such categories should be examined. This paper contains a working definition of functional completeness. We would welcome any feedback on our proposal. SIGMOD Record Annotating Answers with Their Properties. Amihai Motro 1992 When responding to queries, humans often volunteer additional information about their answers. Among other things, they may qualify the answer as to its reliability, and they may provide some abstract characterization of the answer. This paper describes a user interface to relational databases that similarly annotates its answers with their properties. The process assumes that various assertions about properties of the data have been stored in the database (meta-information). These assertions are then used to infer properties of each answer provided by the system (meta-answers). Meta-answers are offered to users along with each answer issued, and help them to assess the value and meaning of the information that they receive. SIGMOD Record Database Research at IPSI. Erich J. Neuhold,Volker Turau 1992 At the Integrated Publication and Information Systems Institute (IPSI) of the GMD (Gesellschaft für Mathematik und Datenverarbeitung) database research is focused towards distributed database management systems to support the integration of heterogeneous multi-media information bases needed in an integrated publishing environment. The objective is to investigate advanced object-oriented and active data modelling concepts together with the principles of distributed data stores and data management. The unifying basis for the database research is the object-oriented data model VML developed in the project VODAK over the last four years. The data model is based on recursively defined meta classes, classes and instance hierarchies paired with a strict separation of structural and operational definitions in a polymorphic type system. This report describes the highlights of the work on the VODAK database system, the research efforts in heterogeneous database integration and the research plans of the multimedia database project as well as applications of our system in different environments such as hypertext and office automation. SIGMOD Record Database Research at the Queensland University of Technology. Mike P. Papazoglou,M. McLoughlin,E. Lindsay,Sylvia Willie 1992 Database Research at the Queensland University of Technology. SIGMOD Record An Overview of GOOD. Jan Paredaens,Jan Van den Bussche,Marc Andries,Marc Gemis,Marc Gyssens,Inge Thyssens,Dirk Van Gucht,Vijay M. Sarathy,Lawrence V. Saxton 1992 GOOD is an acronym, standing for Graph-Oriented Object Database. GOOD is being developed as a joint research effort of Indiana University and the University of Antwerp. The main thrust behind the project is to indicate general concepts that are fundamental to any graph-oriented database user-interface. GOOD does not restrict its attention to well-considered topics such as ad-hoc query facilities, but wants to cover the full spectrum of database manipulations. The idea of graph-pattern matching as a uniform object manipulation primitive offers a uniform framework in which this can be accomplished. SIGMOD Record Bibliography on Database Security. Günther Pernul,Gottfried Luef 1992 Bibliography on Database Security. SIGMOD Record A Process-Oriented Scientific Database Model. J. Michael Pratt,Maxine S. Cohen 1992 A database model is proposed for organizing data that describes natural processes studied experimentally. Adapting concepts from object-oriented and temporal databases, this process-oriented scientific database model (POSDBM) identifies two data object types (independent and dependent variables) and two types of relationships (becomes-a and affects-a) between data objects. Successive versions of dependent variable objects are associated by the becomes-a relationship, while independent and dependent variable objects are associated by the affects-a relationship. Thus, a process can be viewed as a sequence of states (versions) of a dependent variable object whose attributes are affected over time by independent variable objects. SIGMOD Record Summary of Database Research Activities at The University of Massachusetts, Amherst. Krithi Ramamritham,J. Eliot B. Moss,John A. Stankovic,David W. Stemple,W. Bruce Croft,Donald F. Towsley 1992 At the University of Massachusetts, we have been conducting research in the following database related areas: theoretical support for database system development, database programming languages, flexible concurrency control and transaction management, real-time databases, and information retrieval. The following is a summary of our research in each area. SIGMOD Record SQL/SE - A Query Language Extension for Databases Supporting Schema Evolution. John F. Roddick 1992 The incorporation of a knowledge of time within database systems allows for temporally related information to be modelled more naturally and consistently. Adding this support to the metadatabase further enhances its semantic capability and allows elaborate interrogation of data. This paper presents SQL/SE, an SQL extension capable of handling schema evolution in relational database systems. SIGMOD Record Schema Evolution in Database Systems - An Annotated Bibliography. John F. Roddick 1992 Schema Evolution is the ability of a database system to respond to changes in the real world by allowing the schema to evolve. In many systems this property also implies a retaining of past states of the schema. This latter property is necessary if data recorded during the lifetime of one version of the schema is not to be made obsolete as the schema changes. This annotated bibliography investigates current published research with respect to the handling of changing schemas in database systems. SIGMOD Record A Retrospective on Database Application Development Frameworks. Lawrence A. Rowe 1992 Four application framework models developed by the author for database application development systems are described. The key feature of these systems is to provide a model for the definition of high level objects that represent interface abstraction that can be used to build an application. Structuring application code around interface objects reduces the conceptual distance between the executing program and its specification. At the same time, good programming practices must be supported (e.g., code modularity, reusable components, and information hiding). SIGMOD Record Database Research at CITRI. Ron Sacks-Davis,Kotagiri Ramamohanarao 1992 Database Research at CITRI. SIGMOD Record Conferences on Databases / Calls for Papers. Fèlix Saltor 1992 Conferences on Databases / Calls for Papers. SIGMOD Record Conferences on Databases / Calls for Papers. Fèlix Saltor 1992 Conferences on Databases / Calls for Papers. SIGMOD Record Conferences on Databases / Calls for Papers. Fèlix Saltor 1992 Conferences on Databases / Calls for Papers. SIGMOD Record "Editor's Notes." Arie Segev 1992 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1992 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1992 "Editor's Notes." SIGMOD Record Pictures from SIGMOD Conference 1991. Frederick N. Springsteel 1992 Pictures from SIGMOD Conference 1991. SIGMOD Record An Overview of Three Commercial Object-Oriented Database Management Systems: ONTOS, ObjectStore, and O2. Valery Soloviev 1992 We present an analysis of three current object-oriented DBMS products: ONTOS, ObjectStore, and O2, as described by their available documentation. The most attractive feature of ONTOS and Object-Store is their use of C++ as a user interface - a widespread object-oriented language. They also provide persistent data implementation, transaction and recovery mechanisms, and modern application development tool sets following the recommendations of [Atkinson et al. 89]. O2 was chosen for a well-developed data type system and end-user interface, and for its reputation from the literature. SIGMOD Record SIGMOD Innovations Award. 1992 SIGMOD Innovations Award. SIGMOD Record Current Status of R&D in Trusted Database Management Systems. Bhavani M. Thuraisingham 1992 Current Status of R&D in Trusted Database Management Systems. SIGMOD Record Database Research Acitivities at The University of Vienna. A. Min Tjoa,G. Vinek 1992 Database Research Acitivities at The University of Vienna. SIGMOD Record Current Research on Real-Time Databases. Özgür Ulusoy 1992 Current Research on Real-Time Databases. SIGMOD Record A Denotational Semantics for the Starburst Production Rule Language. Jennifer Widom 1992 Researchers often complain that the behavior of database production rules is difficult to reason about and understand, due in part to the lack of formal declarative semantics. It has even been claimed that database production rule languages inherently cannot be given declarative semantics, in contrast to, e.g., deductive database rule languages. In this short paper we dispute this claim by giving a denotational semantics for the Starburst database production rule language. SIGMOD Record "EUG'91 Meeting Notes." Peter R. Wilson 1992 The first annual meeting of the EXPRESS Users Group was held in Houston, Texas on 17-18 October 1991. There was a good international audience at the conference, two thirds of whom were from North America with the remainder from several European countries and Japan. SIGMOD Record Opportunities from the US Department of Defense and NSF. Marianne Winslett 1992 Opportunities from the US Department of Defense and NSF. SIGMOD Record Opportunities in the US from NSF, DARPA, and NASA. Marianne Winslett 1992 In this issue, we begin with general information about the High Performance Computing and Communications (HPCC) Program at NSF, followed by a more focused look at HPCC work in the database area, and a close-up of the new scientific databases initiative. A NASA program for intelligent systems, DARPA programs in 3-D visualization and multiple knowledge sources, news of recent events at DARPA, and two new small business solicitations round out the funding news for this issue. SIGMOD Record The Winds of Change? Marianne Winslett 1992 In this issue, we bring you coverage of recent events in the US that may have an impact on database funding, especially at NSF. We also announce two upcoming NSF Small Business Innovative Research Conferences and requests for proposals from the Air Force, the Army, and the folks at the Strategic Defense Initiative. A new database/knowledge-base BAA and other bits of news from DARPA round out the funding news for this issue. SIGMOD Record Database Research at La Trobe University. John Zeleznikow 1992 Database Research at La Trobe University. ICDE IsaLog: A declarative language for complex objects with hierarchies. Paolo Atzeni,Luca Cabibbo,Giansalvatore Mecca 1993 The IsaLog model and language are presented. The model has complex objects with classes, relations, and is a hierarchies. The language is strongly types and declarative. The main issue is the definition of the semantics of the language, given in three different ways that are shown to be equivalent: a model-theoretic semantics, a reduction to logic programming with function symbols, and a fixpoint semantics. Each of the semantics presents new aspects with respect to existing proposals because of the interaction of oid-invention with general is a hierarchies. The solutions are based on the explicit Skolem functors, which provide a powerful tool for manipulating object-identifiers ICDE A Query Model for Object-Oriented Databases. Reda Alhajj,M. Erol Arkun 1993 A formal object-oriented query model is described in terms of an object algebra. Both the structure and the behavior of objects are handled. An operand and the output from a query in the object algebra are defined to have a pair of sets, i.e. a set of objects and a set of message expressions, where a message expression is a valid sequence of messages. The closure property is therefore maintained in a natural way. In addition, it is proven that the output from a query has the characteristics of a class; hence, the inheritance relationship between the operand and the output from a query is derived ICDE Integrating Functional and Data Modeling in a Computer Integrated Manufacturing System. Nabil R. Adam,Aryya Gangopadhyay 1993 A structured methodology for linking data modeling with functional modeling in a computer integrated manufacturing system is presented. The target application, the functional and data models, and a method for developing the data model starting from the functional model are described. This approach ensures that the data model is complete and non-redundant with respect to the functional model. A scheme that enables various functions in the functional model to be linked with the data elements of the data model is also presented. Such a linkage makes it possible to determine the impact of a change in the functional model on the data model and vice versa ICDE Dynamic Query Optimization in Rdb/VMS. Gennady Antoshenkov 1993 Addresses the key theoretical and practical issues of dynamic query optimization and reviews the underlying reasoning that cements the basic concepts of dynamic query optimization. The optimization mechanics are described, in concert with explanations of why and how certain arrangements contribute to a given optimization goal. Compared to traditional approaches, dynamic query optimization offers a much more realistic view of cost distribution modeling. It is a competition-based architecture that is capable of resolving the major limitations of static optimization, viz. the problems of data skew, cost function instability, and host-variable sensitivity ICDE The O++ Database Programming Language: Implementation and Experience. Rakesh Agrawal,Shaul Dar,Narain H. Gehani 1993 Ode, a database system and environment based on the object paradigm, is discussed. Ode is defined, queried and manipulated using the database programming language O++, which is based on C++. The O++ compiler translates O++ programs into C++ programs that contain calls to the Ode object manager. The implementation of O++, the Ode object manager, and the translation of the database facilities in O++ are described. The problems encountered in the implementation and their resolutions are reviewed ICDE An Access Structure for Generalized Transitive Closure Queries. Rakesh Agrawal,Jerry Kiernan 1993 An access structure that accelerates the processing of generalized transitive closure queries is presented. For an acyclic graph, the access structure consists of source-destination pairs arranged in a topologically sorted order. For a cyclic graph, entries in the structure are approximately topologically sorted. The authors present a breadth-first algorithm for creating such a structure, show how it can be used to process queries, and describe incremental techniques for maintaining it ICDE Adaptive Block Rearrangement. Sedat Akyürek,Kenneth Salem 1993 An adaptive technique for reducing disk seek times is described. The technique copies frequently referenced blocks from their original locations to reserved space near the middle of the disk. Reference frequencies need not be known in advance. Instead, they are estimated by monitoring the stream of arriving requests. Trace-driven simulations show that seek times can be cut substantially by copying only a small number of blocks using this technique. The technique has been implemented by modifying a UNIX device driver. No modifications are required to the file system that uses the driver. ICDE The Gold Mailer. Daniel Barbará,Chris Clifton,Fred Douglis,Hector Garcia-Molina,Stephen Johnson,Ben Kao,Sharad Mehrotra,Jens Tellefsen,Rosemary Walsh 1993 "The Gold Mailer, a system that provides users with an integrated way to send and receive messages using different media, efficiently store and retrieve these messages, and access a variety of sources of other useful information, is described. The mailer solves the problems of information overload, organization of messages and multiple interfaces. By providing good storage and retrieval facilities, it can be used as a powerful information processing engine covering a range of useful office information. The Gold Mailer's query language, indexing engine, file organization, data structures, and support of mail message data and multimedia documents are discussed" ICDE A New Algorithm for Computing Joins with Grid Files. Ludger Becker,Klaus Hinrichs,Ulrich Finke 1993 The BR2-directory representation, a directory structure for grid files, and a join algorithm for the evaluation of general n -ary joins on grid files are presented. It is shown that the CPU cost of the join algorithm is successfully reduced by introducing an inner join. A comparison with the hash join algorithm and a join algorithm on k-d trees for equijoins is based on a cost model developed for query processing with grid files. The join algorithm outperforms the hash join, a specialized join method for equijoins, and the join algorithm on k-d trees for equijoins ICDE Title, Message from the General Chairs, Message from the Program Chairs, Committees, Reviewers, Author Index. 1993 Title, Message from the General Chairs, Message from the Program Chairs, Committees, Reviewers, Author Index. ICDE Comparison of Approximations of Complex Objects Used for Approximation-based Query Processing in Spatial Database Systems. Thomas Brinkhoff,Hans-Peter Kriegel,Ralf Schneider 1993 The minimum bounding box, the convex hull, the minimum bounding four- and five-corner, rotated boxes, and the minimum bounding ellipses and circles convex conservative approximation methods for handling complex spatial objects in spatial access methods are discussed. Results indicate that, depending on the complexity of the objects and the type of queries, the approximations five-corner, ellipse and rotated bounding box clearly outperform the bounding box. It is the reduced number of false hits that yields a considerable improvement in total query time when using the proposed approximations ICDE A Simple Analysis of the LRU Buffer Policy and Its Relationship to Buffer Warm-Up Transient. Anupam Bhide,Asit Dan,Daniel M. Dias 1993 A simple analysis for the transient buffer hit probability for a system starting with an empty buffer is presented. The independent reference model (IRM) is used for buffer accesses. It is shown that the expected buffer hit probability when the buffer becomes full is virtually identical to the steady state buffer hit probability when the replacement policy is least recently used (LRU). The method is generalized to estimate the transient behavior of the LRU policy starting with a non-empty buffer. It is shown that this method can be used to estimate the effect of a load surge on the buffer hit probability. It is also shown that after a short load surge, it can take much longer than the surge duration for the buffer hit probability to return to its steady state value ICDE A Language Multidatabase System Communication Protocol. Omran A. Bukhres,eva Kühn,Franz Puntigam 1993 Rapid growth in the area of multidatabase systems (MDBSs), which involve both the access of global data and distributed transaction processing, has created a need for programming languages that provide communication reliability and powerful synchronization. The requirements of MDBSs are reviewed, and the Vienna Parallel Logic programming language is presented. The ways to realize an MDBS communication protocol in this language are discussed. The Vienna Parallel Logic language incorporates a concurrent logic language, as well as features of both distributed operating systems and database management systems. These features combine to support the communication and synchronization required by distributed transaction processing. The Vienna Parallel Logic language is suitable for use as a general-purpose distributed programming and coordination language ICDE A Bottom-up Query Evaluation Method for Stratified Databases. Yangjun Chen 1993 A labeling algorithm for stratified databases is presented. The algorithm that is performed prior to the magic-set algorithm can be used to distinguish the context for constructing magic sets. It is shown that the culprit cycles cause the destratification of a database. Based on this analysis, three subprocedures are developed to remove the different kinds of culprit cycles. The negnumber procedure numbers the different occurrences of a negative literal in a rule. The dynlabel procedure gives each negative body literal a dynamic subscript when it appears in a recursive rule. The label procedure labels each body literal p when there exists a sequence of paths connecting it to a negative body literal-q in the same rule, or a sequence of paths with at least one path being negative connecting it to a positive body literal q in the same rule and there is an arc of the form N→r in the sideways information-passing strategy (SIPS) such that q∈N and p=r ICDE Using Active Database Techniques For Real Time Engineering Applications. Aloysius Cornelio,Shamkant B. Navathe 1993 An active database model for representing engineering design, simulation and monitoring applications is described. The physical aspects of these applications are modeled by structural objects. The functions of the application are modeled by functional objects. The interaction between structures and functions is modeled by interaction objects. Events relate structures and functions such that any state change will cause the functional model to compute the new consistent state of the application. Ways to model event correlation, event recall, and ways to define schemas for continuous systems are presented. A parallel computing architecture and a set of guidelines on task distributing that make the model applicable to real-time application are discussed ICDE Database Technology and Standards: Are we Getting Anywhere? (Panel Abstract). Peter Dadam,Shamkant B. Navathe 1993 Database Technology and Standards: Are we Getting Anywhere? (Panel Abstract). ICDE Database Access Characterization for Buffer Hit Prediction. Asit Dan,Philip S. Yu,Jen-Yao Chung 1993 Presents a database access characterization method that first distinguishes three types of access pattern from a trace-locality within a transaction, random accesses by transactions, and sequential accesses by long queries. The authors describe a concise way to characterize the access skew across the randomly accessed pages by assuming that the large number of data pages may be logically grouped into a small number of partitions, such that the frequency of accessing each page within a partition can be treated as equal. They present an extensive validation of the buffer hit predictions, both for single-node as well as multiple-node systems, based on access characterization using production database traces. This approach can be applied to predict the buffer hit probability of a composite workload from those of its component files ICDE Polyglot: Extensions to Relational Databases for Sharable Types and Functions in a Multi-Language Environment. Linda G. DeMichiel,Donald D. Chamberlin,Bruce G. Lindsay,Rakesh Agrawal,Manish Arya 1993 Polyglot is an extensible relational-database-type system that supports inheritance, encapsulation, and dynamic method dispatch. It allows use from multiple application languages and permits objects to retain their behavior as they cross the boundary between database and application program. The authors describe the design of Polyglot, extensions to the structured query language (SQL) to support the use of Polyglot types and methods, and the implementation of Polyglot in the Starburst relational database system ICDE Valid-time Indeterminancy. Curtis E. Dyreson,Richard T. Snodgrass 1993 Valid-time Indeterminancy. ICDE The Design, Implementation, and Evaluation of an Object-Based Sharing Mechanism for Federated Database Systems. Doug Fang,Shahram Ghandeharizadeh,Dennis McLeod,Antonio Si 1993 An approach and mechanism to support the sharing of objects are described, an experimental implementation is presented, and the performance of the system is analyzed and evaluated. The mechanism is based on a core set of constructs that characterize object-based database systems. The approach provides a basis for controlled sharing in a heterogeneous database environment, using a kernel object-base model as an intercomponent exchange forum. A major goal is to make the importation of nonlocal information as transparent to a component as possible ICDE Voltaire: A Database Programming Language with a Single Execution Model for Evaluating Queries, Constraints amd Functions. Sunit K. Gala,Shamkant B. Navathe,Manuel E. Bermudez 1993 Voltaire: A Database Programming Language with a Single Execution Model for Evaluating Queries, Constraints amd Functions. ICDE Deriving Integrity Maintaining Triggers from Transition Graphs. Michael Gertz,Udo W. Lipeck 1993 Methods for deriving constraint maintaining triggers from dynamic integrity constraints represented by transition graphs are presented. The methods reduce integrity monitoring to checking changing static conditions according to life cycle situations. Thus, triggers have to be generated from these graphs, which depend not only on the operations that have occurred in a transaction, but also on the situations that have been reached by the objects mentioned in the constraints. The techniques presented work for dynamic constraints and their corresponding transition graphs as well as for simple static constraints. Only passive reactions (rollbacks) to constraint violations are provided by the trigger patterns, but the systematic generation of such patterns should help the database designer in identifying possible active reactions for repairing constraint violations ICDE Audio/Video Databases: An Object-Oriented Approach. Simon J. Gibbs,Christian Breiteneder,Dennis Tsichritzis 1993 The notion of an audio/video (AV) database is introduced. An AV database is a collection of AV values such as digital audio and video data and AV activities such as interconnectable components used to process AV values. Two abstraction mechanisms, temporal composition and flow composition, allow the aggregation of AV values and AV activities respectively. An object-oriented framework, incorporating an AV data model and prescribing AV database/application interaction, is described ICDE A Descriptive Semantic Formalism for Medicine. Carole A. Goble,Andrzej J. Glowinski,W. A. Nowlan,Alan L. Rector 1993 "It is argued that current clinical information systems incorporate oversimplistic, prescriptive data models that are not faithful to clinicians' observations. A non-prescriptive descriptive semantic formalism, Structured Meta Knowledge (SMK), which unifies a terminological knowledge base with controlled assertional capabilities with the medical record and supports the semantic control necessitated by such an approach, is proposed. The three-layer model of categories, individuals, and occurrences described is more appropriate to medical applications than the two layers of classes and instances. The application of SMK in predictive data entry is considered" ICDE Definition and Application of Metaclasses in an Object-Oriented Database Model. Jutta Göers,Andreas Heuer 1993 The metalevel concepts for an object-oriented database model are presented. Usually, systems that include a metaclass concept are either only implicitly supporting the management of meta information, which restricts the application of this concept, or explicitly giving unrestricted access to manipulate metaclasses, which results in inconsistent states of the system. To make the explicit support more system-controlled, the metascheme is partitioned into two parts: the system view and the application view. Possible scheme level operations, the representation of methods and their implementation as a special kind of meta information, and a simple extension to the query algebra of the underlying object-oriented database system that allow users to query objects and metaobjects within a single algebra expression are described ICDE The Volcano Optimizer Generator: Extensibility and Efficient Search. Goetz Graefe,William J. McKenna 1993 The Volcano project, which provides efficient, extensible tools for query and request processing, particularly for object-oriented and scientific database systems, is reviewed. In particular, one of its tools, the optimizer generator, is discussed. The data model, logical algebra, physical algebra, and optimization rules are translated by the optimizer generator into optimizer source code. It is shown that, compared with the EXODUS optimizer generator prototype, the search engine of the Volcano optimizer generator is more extensible and powerful. It provides effective support for non-trivial cost models and for physical properties such as sorting order. At the same time, it is much more efficient, as it combines dynamic programming with goal-directed searching and branch-and-bound pruning. Compared with other rule-based optimization systems, it provides complete data model independence and more natural extensibility ICDE Efficient Computation of Spatial Joins. Oliver Günther 1993 Spatial joins are join operations that involve spatial data types and operators. Due to basic properties of spatial data, many conventional join strategies suffer serious performance penalties or are not applicable at all. The join strategies known from conventional databases that can be applied to spatial joins and the ways in which some of these techniques can be modified to be more efficient in the context of spatial data are discussed. A class of tree structures, called generalization trees, that can be applied efficiently to compute spatial joins in a hierarchical manner are described. The performances of the most promising strategies are analytically modeled and compared ICDE Normalization of Linear Recursions in Deductive Databases. Jiawei Han,Kangsheng Zeng,Tong Lu 1993 A graph-matrix expansion-based compilation technique that transforms complex linear recursions into highly regular linear normal forms (LNFs) is introduced. A variable connection graph-matrix, the V-matrix, is constructed to simulate the expansions of a linear recursion and discover its expansion regularity. Based on the expansion regularity, a linear recursion can be normalized into an LNF. The normalization of linear recursions not only captures the bindings that are difficult to be captured otherwise but also facilitates the development of powerful query analysis and evaluation techniques for complex linear recursions in deductive databases ICDE Data fragmentation for parallel transitive closure strategies. Maurice A. W. Houtsma,Peter M. G. Apers,Gideon L. V. Schipper 1993 Addresses the problem of fragmenting a relation to make the parallel computation of the transitive closure efficient, based on the disconnection set approach. To better understand this design problem, the authors focus on transportation networks. These are characterized by loosely interconnected clusters of nodes with a high internal connectivity rate. Three requirements that have to be fulfilled by a fragmentation are formulated, and three different fragmentation strategies are presented, each emphasizing one of these requirements. Some test results are presented to show the performance of the various fragmentation strategies ICDE Efficient Evaluation of Traversal Recursive Queries Using Connectivity Index. Kien A. Hua,Jeffrey X. W. Su,Chau M. Hua 1993 Introduces the connectivity index, an access structure for the efficient evaluation of traversal-recursive queries. Unlike conventional bottom-up evaluation techniques that require the creation of temporary files and scanning of the relations many times in computing the relational operators, the new access strategy requires only a single-pass scan of the index file. The proposed scheme is illustrated using examples. Algorithms for the maintenance of the index structure are presented ICDE A Competitive Dynamic Data Replication Algorithm. Yixiu Huang,Ouri Wolfson 1993 A distributed algorithm for dynamic data replication of an object in a distributed system is presented. The algorithm changes the number of replicas and their location in the distributed system to optimize the amount of communication. The algorithm dynamically adapts the replication scheme of an object to the pattern of read-write requests in the distributed system. It is shown that the cost of the algorithm is within a constant factor of the lower bound ICDE Performance Characteristics of Epsilon Serializability with Hierarchical Inconsistency Bounds. Mohan Kamath,Krithi Ramamritham 1993 Epsilon serializability (ESR) is a weaker form of correctness designed to provide more concurrency than classic serializability (SR) by allowing, for example, query transactions to view incon- sistent data in a controlled fashion $i.e.$ limiting the incon- sistency within the specified bounds. In the previous literature on ESR, inconsistency bounds have been specified with respect to transactions or with respect to objects. In this paper, we in- troduce the notion of hierarchical inconsistency bounds that al- lows inconsistency to be specified at different granularities. The motivation for this comes from the way data is usually organ- ized, in hierarchical groups, based on some common features and interrelationships. Bounds on transactions are specified at the top of the hierarchy, while bounds on the objects are specified at the bottom and on groups in between. We also discuss mechan- isms needed to control the inconsistency so that it lies within the specified bounds. While executing a transaction, the system checks for possible violation of inconsistency bounds bottom up, starting with the object level and ending with the transaction level. Thus far, to our knowledge, no work has been done to determine the quantitative performance improvement resulting from ESR. Hence in this paper we report on an evaluation of the performance improvement due to ESR incorporating hierarchical inconsistency bounds. The tests were performed on a prototype transaction pro- cessing system that uses timestamp based concurrency control. For simplicity, our implementation uses a two level hierarchy for inconsistency specification - the transaction level and the ob- ject level. We present the results of our performance tests and discuss how the behavior of the system is influenced by the tran- saction and object level inconsistency bounds. We make two im- portant observations from the tests. First, the thrashing point shifts to a higher multiprogramming level when transaction incon- sistency bounds are increased. Further, for a particular mul- tiprogramming level and a particular transaction inconsistency bound, the throughput does not increase with increasing object inconsistency bounds but peaks at some intermediate value. --------------- This material is based upon work supported by the National Science Foundation under grant IRI-9109210. ****************************************************************************** ICDE A Framework for Declarative Updates and Constraint Maintenance in Object-Oriented Databases. Anton P. Karadimce,Susan Darling Urban 1993 A framework for supporting ad-hoc declarative update requests in an object-oriented database (OODB) while maintaining database consistency and atomicity of update requests is described. The framework is based on the emulation of classic update methods in an OODB by a controlled, active, and user-transparent interaction between a predefined set of elementary updates and a set of integrity methods designed to maintain database consistency upon violations of integrity constraints. Given an object-oriented data model and a declarative query language, this framework is extended by isolating declaratively stated integrity constraints as a separate concept, developing a high-level update language on top of the query language, and developing active integrity methods from the integrity constraints. The advantage of this approach is that users can freely pose declarative ad-hoc updates without jeopardizing database consistency ICDE Unification of Temporal Data Models. Christian S. Jensen,Michael D. Soo,Richard T. Snodgrass 1993 A conceptual temporal data model that captures the time-dependent semantics of data while permitting multiple data models at the representation level is described. A conceptual notion of a bitemporal relation in which tuples are stamped with sets of two-dimensional chronons in transaction-time/valid-time space is defined. A tuple-timestamped first normal form representation is introduced to show how the conceptual bitemporal data model is related, by means of snapshot equivalence, with representational models. Querying within the two-level framework is discussed. An algebra is defined at the conceptual level and mapped to the sample representational model in such a way that new operators compute equivalent results for different representations of the same conceptual bitemporal relation ICDE Adaptable Pointer Swizzling Strategies in Object Bases. Alfons Kemper,Donald Kossmann 1993 Four different approaches to optimizing the access to main memory resident persistent objects-techniques which are commonly referred to as pointer swizzling-are classified and evaluated. To speed up the access along inter-object references, the persistent pointers are transformed (swizzled) into main memory pointers (addresses). The pointer swizzling techniques allow the displacement of objects from the buffer before the end of an application, and the authors contrast them with the performance of an object manager using no pointer swizzling. The results of the quantitative evaluation prove that there is no one superior strategy for all application profiles. An adaptable system that uses the full range of pointer swizzling strategies is presented ICDE Post-crash Log Processing for Fuzzy Checkpointing Main Memory Databases. Xi Li,Margaret H. Eich 1993 The impact of updating policy and access pattern on the performance of post-crash log processing with a fuzzy checkpointing main memory database (MMDB) is discussed. The problem of restoring the database to a consistent state and several algorithms for post-crash log processing under the various updating alternatives are reviewed. Using an analytic model, the checkpoint behavior and post-crash log processing performance of these algorithms are examined. Analytic results show that deferred updating always takes less time to process the log after a crash ICDE Deterministic Semantics of Set-Oriented Update Sequences. Christian Laasch,Marc H. Scholl 1993 An iterator is proposed that allows sequences of update operations to be applied in a set-oriented way with deterministic semantics. Because the mechanism is independent of a particular model, it can be used in the relational and in object-oriented ones. Thus, the deterministic semantics of embedded structured query language (SQL) cursors and of triggers that are applied after (set-oriented) SQL updates can be checked. The iterator can be used to apply object-oriented methods, which are usually update sequences defined on a single object, to sets in a deterministic way ICDE Efficient Support of Historical Queries for Multiple Lines of Evolution. Gad M. Landau,Jeanette P. Schmidt,Vassilis J. Tsotras 1993 A general framework for solving multiple-line history queries is presented. The authors address two important historical queries in this environment: the vertical query and the horizontal query. The vertical query enables a design team to find what its design was at a past instant on its own path of evolution, while the horizontal query provides a design team with the designs of relevant teams at concurrent times in the past ICDE Updating Intensional Predicates in Deductive Databases. Dominique Laurent,Viet Phan Luong,Nicolas Spyratos 1993 A method for updating deductive databases that allows the insertion or deletion of a fact over intensional predicates in a deterministic manner is presented. It is shown that, contrary to most other approaches, inserting or deleting facts over tensional predicates can always be accomplished without having to make choices. The approach relies on well-founded semantics and on the following two basic approaches: deleted facts are explicitly stored in the database and the inserted and deleted facts may concern any predicate, not just extensional predicates ICDE Representation and Querying in Temporal Databases: the Power of Temporal Constraints. Manolis Koubarakis 1993 A temporal database model capable of representing absolute, relative, imprecise, and infinite temporal data is proposed. The model is based on temporal tables such as relation-like representations that can contain variables constrained by the formulas of a temporal theory. An algebraic query language for temporal tables is defined, and some problems related to query answering are discussed ICDE Workload Balance and Page Access Scheduling For Parallel Joins In Shared-Nothing Systems. Chiang Lee,Zue-An Chang 1993 A methodology to resolve balancing and scheduling issues for parallel join execution in a shared-nothing multiprocessor environment are presented. In the past, research on parallel join methods focused on the design of algorithms for partitioning relations and distributing data buckets as evenly as possible to the processors. Once data are uniformly distributed to the processors, it is assumed that all processors will complete their tasks at about the same time. The authors stress that this is true if no further information, such as page-level join index, is available. Otherwise, the join execution can be further optimized and the workload in the processors may still be unbalanced. The authors study these problems in a shared-nothing environment ICDE Entity Identification in Database Integration. Ee-Peng Lim,Jaideep Srivastava,Satya Prabhakar,James Richardson 1993 The objective of entity identification is to determine the correspondence between object instances from more than one database. Entity identification at the instance level, assuming that schema level heterogeneity has been resolved a priori, is examined. Soundness and completeness are defined as the desired properties of any entity identification technique. To achieve soundness, a set of identity and distinctness rules are established for entities in the integrated world. The use of extended key, which is the union of keys, and possibly other attributes, from the relations to be matched, and its corresponding identify rule are proposed to determine the equivalence between tuples from relations which may not share any common key. Instance level functional dependencies (ILFD), a form of semantic constraint information about the real-world entities, are used to derive the missing extended key attribute values of a tuple ICDE Automating Fine Concurrency Control in Object-Oriented Databases. Carmelo Malta,José Martinez 1993 Four major problems that complicate read and write accesses to instances in object-oriented databases are discussed. The four problems are: difficulty in providing ad hoc commutativity relations, lacking overhead, lacking escalation, and pseudo-conflicts. It is shown that all of these problems can be solved by providing a simple form of commutativity. This kind of commutativity is syntactically extracted from the source codes of the methods at compile-time. Then, an efficient linear algorithm calculates the transitive access vectors. Finally, transitive access vectors are translated into classical access modes in order not to incur performance penalty at run-time. Related works on access vectors and field locking are reviewed ICDE Object Queries over Relational Databases: Language, Implementation, and Applications. Victor M. Markowitz,Arie Shoshani 1993 A query language called the Concise Object Query Language (COQL) is described. COQL is unique in its conciseness, in its support of inheritance, and in the capabilities it provides for defining application-specific structures. The COQL-to-SQL translation, its implementation on top of a commercial relational DBMS, and the ways in which COQL can be used for constructing application-specific views for scientific applications are discussed. The typical three-level architecture approach for supporting data management applications and previous work on the translation of extended entity-relationships schemas into relational database management system schemas are reviewed ICDE Feature-Based Retrieval of Similar Shapes. Rajiv Mehrotra,James E. Gary 1993 A shape similarity-based retrieval method for image databases that supports a variety of queries is proposed. It is flexible with respect to the choice of feature and definition of similarity and is implementable using existing multidimensional point access methods. A prototype system that handles the problems of distortion and occlusion is described. Experiments with one specific point access method (PAM) are presented ICDE Batch Scheduling in Parallel Database Systems. Manish Mehta,Valery Soloviev,David J. DeWitt 1993 Many techniques for query scheduling in a parallel database system schedule a single query at a time. The scheduling of queries for parallel database systems by dividing the workload into batches is investigated. Scheduling algorithms that exploit the common operations within the queries in a batch are proposed. The performance of the proposed algorithms is studied using a simple analytical model and a detailed simulation model. It is shown that batch scheduling can provide significant savings compared to single query scheduling for a variety of system and workload parameters ICDE Construction of a Relational Front-end for Object-Oriented Database Systems. Weiyi Meng,Clement T. Yu,Won Kim,Gaoming Wang,Tracy Pham,Son Dao 1993 Proposes a solution for the construction of a relational front-end for object-oriented database systems (OODBs). Rules are provided to transform the structural part of an OODB scheme to an equivalent relational scheme to provide relational users with a relational view of the OODB scheme. A mechanism based on a relational predicate graph and an OODB predicate graph is provided to translate relational queries to OODB queries to allow relational users access to data stored in an OODB database system ICDE SQL/XNF - Processing Composite Objects as Abstractions over Relational Data. Bernhard Mitschang,Hamid Pirahesh,Peter Pistor,Bruce G. Lindsay,Norbert Südkamp 1993 SQL/XNF - Processing Composite Objects as Abstractions over Relational Data. ICDE Towards More Flexible Schema Management in Object Bases. Guido Moerkotte,Andreas Zachmann 1993 An approach to database schema management is presented that allows easy tailoring of schema management, high-level specification of schema consistency, and development of advanced tools supporting the user during schema evolution. The application of this approach to the development of a simple schema manager for the core of the GOM database programming language is described. The flexibility afforded both developers and users by the approach is also discussed ICDE ARIES/LHS: A Concurrency Control and Recovery Method Using Write-Ahead Logging for Linear Hashing with Separators. C. Mohan 1993 The algorithm for recovery and isolation exploitation semantics for linear hashing with separators (ARIES/LHS) that controls concurrent operations on storage structures by different users is presented. The algorithm uses fine granularity locking, guarantees serializability, and prevents rolling back transactions from getting involved in deadlocks ICDE Algorithms for the Management of Remote Backup Data Bases for Disaster Recovery. C. Mohan,Kent Treiber,Ron Obermarck 1993 A method for managing a remote backup database to provide protection from disasters that destroy the primary database is presented. The method is general enough to accommodate the ARIES-type recovery and concurrency control methods as well as the methods used by other systems such as DB2, DL/I and IMS Fast Path. It provides high performance by exploiting parallelism and by reducing inputs and outputs using different means, like log analysis and choosing a different buffer management policy from the primary one. Techniques are proposed for checkpointing the state of the backup system so that recovery can be performed quickly in case the backup system fails, and for allowing new transaction activity to begin even as the backup is taking over a primary failure. Some performance measurements taken from a prototype are also presented ICDE The Correctness of Concurrency Control for Multiversion Database Systems with Limited Number of Versions. Tadeusz Morzy 1993 The concurrency control problem for multiversion database systems (MVDBSs) with system-imposed upper bounds on the total number of data item versions stored in the database is considered. Concurrency control theory for MVDBSs is reviewed. The inadequacy of this theory for analyzing concurrency control algorithms for k-version database systems (KVDBSs) is demonstrated. A formal concurrency control theory for KVDBS is presented. It is developed in terms of KV schedules. The relationships among mono-multi, and KV schedules are summarized ICDE Semantic Concurrency Control in Object-Oriented Database Systems. Peter Muth,Thomas C. Rakow,Gerhard Weikum,Peter Brössler,Christof Hasse 1993 A locking protocol for object-oriented database systems (OODBSs) is presented. The protocol can exploit the semantics of methods invoked on encapsulated objects. Compared to conventional page-oriented or record-oriented concurrency control protocols, the proposed protocol greatly improves the possible concurrency because commutative method executions on the same object are not considered as a conflict. An OODBS application example is presented. The principle of open-nested transactions is reviewed. It is shown that, using the locking protocol in an open-nested transaction, the locks of a subtransaction are released when the subtransaction completes, and only a semantic lock is held further by the parent of the subtransaction ICDE Sampling from Spatial Databases. Frank Olken,Doron Rotem 1993 Techniques for obtaining random point samples from spatial databases are described. Random points are sought from a continuous domain that satisfy a spatial predicate which is represented in the database as a collection of polygons. Several applications of spatial sampling are described. Sampling problems are characterized in terms of two key parameters: coverage (selectivity), and expected stabbing number (overlap). Two fundamental approaches to sampling with spatial predicates, depending on whether one samples first or evaluates the predicate first, are discussed. The approaches are described in the context of both quadtrees and R-trees, detailing the sample-first, A/R-tree, and partial area tree algorithms. A sequential algorithm, the one-pass spatial reservoir algorithm, is also described ICDE An Object-Oriented View Onto Public, Heterogeneous Text Databases. Andreas Paepcke 1993 Even though companies maintain highly-structured traditional business data in relational databases, large amounts of information are available in semi-structured text sources such as indexed online newspapers, patent information, literature citations, or business profiles. This information is offered by commercial providers who maintain complete control over access language, schemas and update capabilities. One way to unify access to all of this material is to make it look like a collection of objects in an object-oriented database. Such a view has been prototyped on an information service that provides some 400 full-text, bibliographic and numeric databases. The authors explain how the illusion of object-orientedness is put together and how it is maintained in queries. They also know how the object-oriented approach is used to handle some classes of schema heterogeneity ICDE JazzMatch: Fine-Grained Parallel Matching for Large Rule Sets. Marco Richeldi,Jack Tan 1993 JazzMatch, a parallel matching algorithm explicitly designed for secondary memory-based production systems, is presented. JazzMatch is a state-saving algorithm that performs incremental match. It exploits extremely fine-grained parallelism and optimizes the storage state by permitting the sharing of common conditions in the rules. JazzMatch is supported by a message passing parallel architecture. A cost and performance analysis of JazzMatch is provided, and the results are compared with those for other schemes in current literature ICDE SLEVE: Semantic Locking for EVEnt synchronisation. Andrea H. Skarra 1993 SLEVE: Semantic Locking for EVEnt synchronisation. ICDE Path Computation Algorithms for Advanced Traveller Information System (ATIS). Shashi Shekhar,Ashim Kohli,Mark Coyle 1993 "Three path-planning algorithms for single-pair path computation are evaluated. These algorithms are the iterative breath-first search, Dijkstra's single-source path-planning algorithm, and the A* single-path planning algorithm. The performance of the algorithms is evaluated on graphs representing the roadmap of Minneapolis. In order to get an insight into their relative performance, synthetic grid maps are used as a benchmark computation. The effects of two parameters, namely path length and edge-cost-distribution, on the performance of the algorithms are examined. The effects of implementation decisions on the performance of the A* algorithm are discussed. The main hypothesis is that estimator functions can improve the average-case performance of single-pair path computation when the length of the path is small compared to the diameter of the graph. This hypothesis is examined using experimental studies and analytical cost modeling" ICDE Two-Phase Commit Optimizations and Tradeoffs in the Commercial Environment. George Samaras,Kathryn Britton,Andrew Citron,C. Mohan 1993 Eleven two-phase commit (2PC) protocol variations that optimize towards the normal case are described and compared with a baseline 2PC protocol. Environments in which they are most effective are discussed. The variations are compared and contrasted in terms of number of message flows, number of log writes (both forced and non-forced), probability of heuristic damage, how damage is reported, and other tradeoffs ICDE Algebraic Foundation and Optimization for Object Based Query Languages. Vijay M. Sarathy,Lawrence V. Saxton,Dirk Van Gucht 1993 The Tarski algebra, an algebraic foundation for object-based query languages, is presented. While maintaining physical data independence, the Tarski algebra is shown to be both simple and powerful enough to express all reasonable queries. It is shown how queries expressed in a graph-oriented query language (based on the functional data model) can be translated into the Tarski algebra. The graphical representation of queries in combination with the Tarski algebra is shown to be a convenient mechanism for effective query optimization ICDE Transaction Support in a Log-Structured File System. Margo I. Seltzer 1993 The design and implementation of a transaction manager embedded in a log-structured file system are described. Measurements indicate that transaction support on a log-structured file system offers a 10% performance improvement over transaction support on a conventional read-optimized file system. When the transaction manager is embedded in the log-structured file system, the resulting performance is comparable to that of a more traditional, user-level system. The performance results also indicate that embedding transactions in the file system need not impact the performance of nontransaction applications ICDE Can OODB Technology Solve CAD Design Data Management Problems? (Panel Abstract). Anoop Singhal 1993 Can OODB Technology Solve CAD Design Data Management Problems? (Panel Abstract). ICDE Recursive Functions in Iris. Philippe De Smedt,Stefano Ceri,Marie-Anne Neimat,Ming-Chien Shan,Rafi Ahmed 1993 A complete and efficient implementation of linear, one-side recursive queries in Iris, an object-oriented database management system, is described. It is shown that recursion can be easily and efficiently added to a large class of existing database management systems. A B-tree type access path called the B++ tree that has been implemented to support the computation of recursive functions in Iris is also described. The perforamnce of B++ trees is reviewed ICDE A Truncating Hash Algorithm for Processing Band-Join Queries. Valery Soloviev 1993 The truncating-hash band join algorithm for evaluating band joins is described. This algorithm is based on the idea of truncating join attribute values in order to execute band joins in a way similar to hash join algorithms for equijoins. Unlike previously proposed algorithms for band joins, it does not sort either of the input relations during its execution. A comparison between the truncating-hash band join algorithm and previous algorithms for band joins using an analytical model is presented. The model also compares an evaluation of band join for a parallel implementation on a shared-nothing multiprocessor system. The results show that the truncating-hash band join algorithm outperforms the other band join algorithms because of a significantly lower CPU cost ICDE Model-Based Design Bases for Task-Oriented Systems. Christian Stary 1993 The nature of task-oriented application design and the rationale in building interactive design support tools are discussed. A model-based approach for the representation of design knowledge is introduced. The approach comprises an end-user task model, a problem domain data model, and an interaction domain model. These models are mapped onto an object hierarchy, allowing the construction of a design knowledge base with reusable objects ICDE Are We Polishing a Round Ball? (Panel Abstract). Michael Stonebraker 1993 Are We Polishing a Round Ball? (Panel Abstract). ICDE Large Object Support in POSTGRES. Michael Stonebraker,Michael A. Olson 1993 This paper presents four implementations for support of large objects in POSTGRES. The four implementations offer varying levels of support for user-defined storage managers available in POSTGRES is also detailed. The performance of all four large object implementations on two different storage devices is presented. ICDE Execution of Extended Multidatabase SQL. L. Suardi,Marek Rusinkiewicz,Witold Litwin 1993 The multidatabase structured query language (MSQL) is an extension of the SQL query language that provides new functions for nonprocedural manipulation of data in different and mutually non-integrated relational databases. The problems introduced by these new functions are discussed, and the semantics of multiple updates, global commitment, and rollback are analyzed. New language constructs are developed to allow declarative specification of multidatabase transactions. The design and implementation of an environment for the execution of extended MSQL queries in a heterogeneous multidatabase environment are also discussed ICDE A Polynomial Time Algorithm for Optimizing Join Queries. Arun N. Swami,Balakrishna R. Iyer 1993 The dynamic programming algorithm for query optimization has exponential complexity. An alternative polynomial time algorithm, the IK-KBZ algorithm, is severely limited in the queries it can optimize. Other algorithms have been proposed, including the greedy algorithm, iterative improvement, and simulated annealing. The AB algorithm, which combines randomization and neighborhood search with the IK-KBZ algorithm, is presented. The AB algorithm is much more generally applicable than IK-KBZ, has polynomial time and space complexity, and produces near optimal plans in the space of outer linear join trees. On average, it does better than the other algorithms that do not do an exhaustive search like dynamic programming ICDE Continuous Backup Systems Utilizing Flash Memory. Hiroki Takakura,Yahiko Kambayashi 1993 A continuous-backup mechanism for main-memory databases that transmits data to archive storage during utilization of main memory without any software assistance is described. It is suggested that flash memory-based storage can improve the efficiency of conventional disk systems since it can realize faster read and write operations. As sequential access is performed by a series of direct accesses, the overhead caused by scheduling to utilize sequential access is not required. One serious drawback of flash memory is the limit of the number of rewrite operations. Mechanisms that have a five-year lifetime have been developed using existing technology. Results of a performance evaluation of the backup system are presented ICDE An Evaluation of Physical Disk I/Os for Complex Object Processing. Wouter B. Teeuw,Christian Rich,Marc H. Scholl,Henk M. Blanken 1993 In order to obtain the performance required for nonstandard database environments, a hierarchical complex object model with object references is used as a storage structure for complex objects. Several storage models for these complex objects, as well as a benchmark to evaluate their performance, are described. A cost model for analytical performance evaluation is developed, and the analytical results are validated by means of measurements on the DASDBS, complex object storage system. The results show which storage structures for complex objects are the most efficient under which circumstances ICDE The Efficient Computation of Strong Partial Transitive-Closures. Ismail H. Toroslu,Ghassan Z. Qadah 1993 "The development of efficient algorithms to process the different forms of transitive-closure queries within the context of large database systems has attracted a large volume of research efforts. The authors present a new algorithm that is suitable for processing one of these forms, the strong partially instantiated query, in which one of the query's arguments is instantiated to a set of constants. The processing of this algorithm yields a set of tuples that draw their values from both of the query's instantiated and uninstantiated arguments. This algorithm avoids the redundant computations and the high storage costs found in a number of similar algorithms" ICDE On Modularity for Conceptual Data Models and the Consequences for Subtyping, Inheritance & Overriding. Olga De Troyer,René Janssen 1993 On Modularity for Conceptual Data Models and the Consequences for Subtyping, Inheritance & Overriding. ICDE Computation, Information, Communication, Imagination (Abstract). Dennis Tsichritzis 1993 Computation, Information, Communication, Imagination (Abstract). ICDE Parallel Database Systems: the case for shared-something. Patrick Valduriez 1993 Parallel database systems are becoming the primary application of multiprocessor computers. The reason for this is that they can provide high-performance and high-availability database support at a much lower price than do equivalent mainframe computers. The traditional shared-memory, shared-disk, and shared-nothing architectures of parallel database systems are compared, based on the following dimensions: simplicity, cost, performance, availability and extensibility. Based on these comparisons, the case is made for the shared-something architecture, which can provide a better trade-off between the various objectives ICDE Data Quality Requirements Analysis and Modeling. Richard Y. Wang,Henry B. Kon,Stuart E. Madnick 1993 A set or premises, terms, and definitions for data quality management are established, and a step-by-step methodology for defining and documenting data quality parameters important to users is developed. These quality parameters are used to determine quality indicators about the data manufacturing process, such as data source creation time, and collection method, that are tagged to data items. Given such tags, and the ability to query over them, users can filter out data having undesirable characteristics. The methodology provides a concrete approach to data quality requirements collection and documentation. It demonstrates that data quality can be an integral part of the database design process. A perspective on the migration towards quality management of data in a database environment is given ICDE Dynamic Finite Versioning: An Effective Versioning Approach to Concurrent Transaction and Query Processing. Kun-Lung Wu,Philip S. Yu,Ming-Syan Chen 1993 Dynamic finite versioning (DFV) schemes that effectively support concurrent processing of transaction and queries are presented. Without acquiring locks, queries read from a small, fixed number of dynamically derived, transaction-consistent, possibly slightly obsolete, logical snapshots of the database. On the other hand, transactions access the most up-to-date data in the database without data contention from queries. Intermediate versions created between snapshots are automatically discarded. Dirty pages updated by active transactions are allowed to be written back into the database before commitment and, at the same time, consistent logical snapshots can be advanced automatically without quiescing the ongoing transactions or queries ICDE On Updates and Inconsistency Repairing in Knowledge Bases. Beat Wüthrich 1993 A technique to compute a solution for a given update and a given knowledge base is presented. The salient features of this approach are: the problem is tackled in a general way and presents a technique for repairing inconsistency; the user can interact with the system, actively influence the solution to be generated, and is not forced to generate all possible minimal solutions from which the user finally draws the most preferred one; solutions are obtained by set-oriented fact processing rather than by single fact accesses to the old knowledge base; and, in contrast to the other proposed techniques, the consistency of the old knowledge base is exploited when generating a solution SIGMOD Conference Database System Issues in Nomadic Computing. Rafael Alonso,Henry F. Korth 1993 Mobile computers and wireless networks are emerging technologies that will soon be available to a wide variety of computer users. Unlike earlier generations of laptop computers, the new generation of mobile computers can be an integrated part of a distributed computing environment, one in which users change physical location frequently. The result is a new computing paradigm, nomadic computing. This paradigm will affect the design of much of our current systems software, including that of database systems. This paper discusses in some detail the impact of nomadic computing on a number of traditional database system concepts. In particular, we point out how the reliance on short-lived batteries changes the cost assumptions underlying query processing. In these systems, power consumption competes with resource utilization in the definition of cost metrics. We also discuss how the likelihood of temporary disconnection forces consideration of alternative transaction processing protocols. The limited screen space of mobile computers along with the advent of pen-based computing provides new opportunities and new constraints on database interfaces and languages. Lastly, we believe that the movement of computers and data among networks potentially belonging to distinct, autonomous organizations creates serious security problems. SIGMOD Conference A New Perspective on Rule Support for Object-Oriented Databases. Eman Anwar,L. Maugis,Sharma Chakravarthy 1993 This paper proposes a new approach for supporting reactive capability in an object-oriented database. We introduce an event interface, which extends the conventional object semantics to include the role of an event generator. This interface provides a basis for the specification of events spanning sets of objects, possibly from different classes, and detection of primitive and complex events. This approach clearly separated event detection from rules. New rules can be added and use existing objects, enabling objects to react to their own changes as well as to the changes of other objects. We use a runtime subscription mechanism, between rules and objects to selectively monitor particular objects dynamically. This elegantly supports class level as well as instance level rules. Both events and rules are treated as first class objects. SIGMOD Conference SIMS: Retrieving and Integrating Information From Multiple Sources. Yigal Arens,Craig A. Knoblock 1993 SIMS: Retrieving and Integrating Information From Multiple Sources. SIGMOD Conference Methods and Rules. Serge Abiteboul,Georg Lausen,Heinz Uphoff,Emmanuel Waller 1993 We show how classical datalog semantics can be used directly and very simply to provide semantics to a syntactic extension of datalog with methods, classes, inheritance, overloading and late binding. Several approaches to resolution are considered, implemented in the model, and formally compared. They range from resolution in C++ style to original kinds of resolution suggested by the declarative nature of the language. We show connections to view specification and a further extension allowing runtime derivation of the class hierarchy. SIGMOD Conference Mining Association Rules between Sets of Items in Large Databases. Rakesh Agrawal,Tomasz Imielinski,Arun N. Swami 1993 We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm. SIGMOD Conference Using the Co-existence Approach to Achieve Combined Functionality of Object-Oriented and Relational Systems. R. Ananthanarayanan,Vibby Gottemukkala,Wolfgang Käfer,Tobin J. Lehman,Hamid Pirahesh 1993 Once considered a novelty, object oriented systems have now entered the mainstream. Their impressive performance and rich type systems have created a demand for object oriented features in other areas, such as relational database systems. We believe the current efforts to combine object oriented and relational features into a single hybrid system will fall short of the mark, whereas our approach, the co-existence approach, has the distinction of requiring far less work, but at the same time promising both the desired functionality and performance. We describe the attributes of our co-existing systems, an object oriented system (C++) and a relational system (Starburst), and show how this combination supports the desired features of both object-oriented and relational systems. SIGMOD Conference "The ""SUPER"" Project." Martin Andersson,Annamaria Auddino,Yann Dupont,Edi Fontana,M. Gentile,Stefano Spaccapietra 1993 "The ""SUPER"" Project." SIGMOD Conference Experiences Building the Open OODB Query Optimizer. José A. Blakeley,William J. McKenna,Goetz Graefe 1993 "This paper reports our experiences building the query optimizer for TI's Open OODB system. To the best of our knowledge, it is the first working object query optimizer to be based on a complete extensible optimization framework including logical algebra, execution algorithms, property enforcers, logical transformation rules, implementation rules, and selectivity and cost estimation. Our algebra incorporates a new materialize operator with its corresponding logical transformation and implementation rules that enable the optimization of path expressions. Initial experiments on queries obtained from the object query optimization literature demonstrate that our optimizer is able to derive plans that are as efficient as, and often substantially more efficient than, the plans generated by other query optimization strategies. These experiments demonstrate that our initial choices for populating each part of our optimization framework are reasonable. Our experience also shows that having a complete optimization framework is crucial for two reasons. First, it allows the optimizer to discover plans that cannot be revealed by exploring only the alternatives provided by the logical algebra and its transformations. Second, it helps and forces the database system designer to consider all parts of the framework and to maintain a good balance of choices when incorporating a new logical operator, execution algorithm, transformation rule, or implementation rule. The Open OODB query optimizer was constructed using the Volcano Optimizer Generator, demonstrating that this second-generation optimizer generator enables rapid development of efficient and effective query optimizers for non-standard data models and systems." SIGMOD Conference On the Power of Algebras with Recursion. Catriel Beeri,Tova Milo 1993 We consider the relationship between the deductive and the functional/algebraic query language paradigms. Previous works considered this subject for a non-recursive algebra, or an algebra with a fixed point operation, and the corresponding class of deductive queries is that defined by stratified programs. We consider here algebraic languages extended by general recursive definitions. We also consider languages that allow non-restricted use of negation. It turns out that recursion and negation in the algebraic paradigm need to be studied together. The semantics used for the comparison is the valid semantics, although other well-known declarative semantics can also be used to derive similar results. We show that the class of queries expressed by general deduction with negation can be captured using algebra with recursive definitions. SIGMOD Conference Loading Data into Description Reasoners. Alexander Borgida,Ronald J. Brachman 1993 Knowledge-base management systems (KBMS) based on description logics are being used in a variety of situations where access is needed to large amounts of data stored in existing relational databases. We present the architecture and algorithms of a system that converts most of the inferences made by the KBMS into a collection of SQL queries, thereby relying on the optimization facilities of existing DBMS to gain efficiency, while maintaining an object-centered view of the world with a substantive semantics and significantly different reasoning facilities than those provided by Relational DBMS and their deductive extensions. We address a number of optimization issues that arise in the translation process due to the fact that SQL queries with different syntax (but identical semantics) are not treated uniformly by current database management systems. SIGMOD Conference Index Support for Rule Activation. David A. Brant,Daniel P. Miranker 1993 Integrated rule and database systems are quickly moving from the research laboratory into commercial systems. However, the current generation of prototypes are designed to work with small rule sets involving limited inferencing. The problem of supporting large complex rule programs within database management systems still presents significant challenges. The basis for many of these challenges is providing support for rule activation. Rule activation is defined as the process of determining which rules are satisfied and what data satisfies them. In this paper we present performance results for the DATEX database rule system and its novel indexing technique for supporting rule activation. Our approach assumes that both the rule program and the database must be optimized synergistically. However, as an experimental result we have determined that DATEX requires very few changes to a standard DBMS environment, and we argue that these changes are reasonable for the problems being solved. Based on the performance of DATEX we believe we have demonstrated a satisfactory solution to the rule activation problem for complex rule programs operating within a database system. SIGMOD Conference Efficient Processing of Spatial Joins Using R-Trees. Thomas Brinkhoff,Hans-Peter Kriegel,Bernhard Seeger 1993 Spatial joins are one of the most important operations for combining spatial objects of several relations. The efficient processing of a spatial join is extremely important since its execution time is superlinear in the number of spatial objects of the participating relations, and this number of objects may be very high. In this paper, we present a first detailed study of spatial join processing using R-trees, particularly R*-trees. R-trees are very suitable for supporting spatial queries and the R*-tree is one of the most efficient members of the R-tree family. Starting from a straightforward approach, we present several techniques for improving its execution time with respect to both, CPU- and I/O-time. Eventually, we end up with an algorithm whose total execution time is improved over the first approach by an order of magnitude. Using a buffer of reasonable size, I/O-time is almost optimal, i.e. it almost corresponds to the time for reading each required page of the relations exactly once. The performance of the various approaches is investigated in an experimental performance comparison where several large data sets from real applications are used. SIGMOD Conference InterBase: A Multidatabase Prototype System. Omran A. Bukhres,Jiansan Chen,Ahmed K. Elmagarmid,Xianging Liu,James G. Mullen 1993 InterBase: A Multidatabase Prototype System. SIGMOD Conference An InterBase System at BNR. Omran A. Bukhres,Jiansan Chen,Rob Pezzoli 1993 An InterBase System at BNR. SIGMOD Conference The LOGRES prototype. Filippo Cacace,Stefano Ceri,Stefano Crespi-Reghizzi,Piero Fraternali,Stefano Paraboschi,Letizia Tanca 1993 Logres is a new-generation database system integrating features from deductive and object-oriented databases [1, 2, 3, 4, 5]. The data model of Logres supports structural and semantic complexity through a rich collection of concepts from object-oriented models. The rule language allows for the manipulation of complex objects, the generation of new objects, and the definition of passive and active constraints. The application of set of rules to database states is controlled by means of qualifiers, which dictate the side effects of rules; qualifiers are the unique procedural feature of Logres, otherwise a fully declarative language. SIGMOD Conference The oo7 Benchmark. Michael J. Carey,David J. DeWitt,Jeffrey F. Naughton 1993 The oo7 Benchmark. SIGMOD Conference Tapes Hold Data, Too: Challenges of Tuples on Tertiary Store. Michael J. Carey,Laura M. Haas,Miron Livny 1993 Tapes Hold Data, Too: Challenges of Tuples on Tertiary Store. SIGMOD Conference The Design and Implementation of CoBase. Wesley W. Chu,M. A. Merzbacher,L. Berkovich 1993 CoBase, a cooperative database, is a new type of distributed database that integrates knowledge base technology with database systems to provide cooperative (approximate and conceptual) query answering. Based on the database schema and application characteristics, data are organized into conceptual (type abstraction) hierarchies. The higher levels of the hierarchy provide a more abstract data representation than the lower levels. Generalization (moving up in the hierarchy), specialization (moving down in the hierarchy) and association (moving between hierarchies) are the three key operations in deriving cooperative query answers. Relaxation in CoBase can also be specified explicitly in the query by the user or calling program through cooperative operators. We have extended SQL to CSQL by adding cooperative primitives. We describe the CoBase software implementation, including an inter-module data protocol that provides a uniform module interface. This modular approach provides flexibility in adding new relaxation modules and simplifies software maintenance. CoBase uses LOOM as its knowledge representation and inference system and supports relational data bases (e.g. Oracle and Sybase). We have demonstrated the feasibility and functionality of CoBase on top of a Transportation Database. The CoBase methodology has also been adopted in the multi-media medical distributed database project at UCLA, which provides approximate query answers to medical queries. SIGMOD Conference Role of Interoperability in Business Application Development. David Cohen,Gary Larson,Larry Berke 1993 Role of Interoperability in Business Application Development. SIGMOD Conference Replicated Data in a Distributed Environment. Malcolm Colton 1993 Replication Server, a forthcoming product that dynamically maintains subsets of data in a distributed environment, providing several transaction models to maintain loose consistency, is contrasted with existing products which provide location-transparent reads and high-consistency coordinated commits. Replication Server makes it possible to build systems that are much more robust in the face of system component failures. By moving transactions rather than data, and by locating data at the point of processing, it maximizes the use of network bandwidth, enabling the deployment of robust, high-performance applications at a lower cost than with traditional approaches SIGMOD Conference Hy+: A Hygraph-based Query and Visualization System. Mariano P. Consens,Alberto O. Mendelzon 1993 Hy+: A Hygraph-based Query and Visualization System. SIGMOD Conference Practical Prefetching via Data Compression. Kenneth M. Curewitz,P. Krishnan,Jeffrey Scott Vitter 1993 "An important issue that affects response time performance in current OODB and hypertext systems is the I/O involved in moving objects from slow memory to cache. A promising way to tackle this problem is to use prefetching, in which we predict the user's next page requests and get those pages into cache in the background. Current databases perform limited prefetching using techniques derived from older virtual memory systems. A novel idea of using data compression techniques for prefetching was recently advocated in [KrV, ViK], in which prefetchers based on the Lempel-Ziv data compressor (the UNIX compress command) were shown theoretically to be optimal in the limit. In this paper we analyze the practical aspects of using data compression techniques for prefetching. We adapt three well-known data compressors to get three simple, deterministic, and universal prefetchers. We simulate our prefetchers on sequences of page accesses derived from the OO1 and OO7 benchmarks and from CAD applications, and demonstrate significant reductions in fault-rate. We examine the important issues of cache replacement, size of the data structure used by the prefetcher, and problems arising from bursts of “fast” page requests (that leave virtually no time between adjacent requests for prefetching and book keeping). We conclude that prediction for prefetching based on data compression techniques holds great promise." SIGMOD Conference Third Generation TP Monitors: A Database Challenge. Umeshwar Dayal,Hector Garcia-Molina,Meichun Hsu,Ben Kao,Ming-Chien Shan 1993 "In a 1976 book, “Algorithms + Data Structures = Programs” [15], Niklaus Wirth defined programs to be algorithms and data structures. Of course, by now we know that man does not live from programs alone, and that there is a second fundamental computer science equation: “Programs + Databases = Information Systems.” Database researchers have traditionally focused on the database component of the equation, providing shared and persistent repositories for the data that programs need and produce. As a matter of fact, a lot of us have worked hard to hide or ignore the programs component. For instance, non-procedural languages like SQL and relational algebra have been the holy grail of the database field, letting us describe the data in the way we want without need to write messy programs. The magic wand of transactions makes programs that execute concurrently with our non-procedural statements suddenly disappear: these other programs appear as atomic actions that are either executed before we started looking at our data, or will be executed after we are all done with our work. The wonders of fault tolerance and automatic recovery guarantee that we never have to concern ourselves with our statements failing or being interrupted. The data we need will always be there for us, and our statements will always run to completion. Unfortunately, the real programs that operate on databases are in many cases more complex than the classical ones like “withdraw 100 dollars from my account” or “find me all my blue eyed great-grandfathers.” For one, programs may be much longer, requiring many database interactions. Furthermore, programs need to interact with other concurrent programs, getting results from and to them. They may also need to be aware of their environment, perhaps monitoring the execution of another program, or taking corrective action when some system components fail. Of course, this is not to say that transactions and non-procedural query languages have not been great contributions. In many cases, they are all that is needed to program one's application. But beyond that there are many cases when one must deal with multiple concurrent applications. Indeed, a critical problem facing complex enterprises is the automation of complex business processes. Enterprises today are drowning in an ocean of data, with a few isolated islands of automation consisting of heterogeneous databases and legacy application programs, each of which automates some point function (e.g., order entry, inventory, accounting, billing) within the enterprise. As the enterprise attempts to automate its business processes, these isolated islands have to be bridged: complex information systems must be developed that need to span many of these databases and application programs. Traditional database systems do not provide the supporting environment for this. Our programming languages colleagues have been working on the programs component of our fundamental equation, but the database component has traditionally been ignored or hidden. There has been a lot of recent interest on languages that support persistent objects, but often the goal is to make the database that holds the objects look as little as possible like a database. That is, the persistent objects are to be handled just as if they were volatile objects, even though they are not. Also, the programming languages researchers have borrowed the notions of transactions and serializable schedules to hide as much as possible concurrent execution and failures of programs. Finally, traditional programming languages (there are exceptions[4, 13]) have focused on “programming in the small,” as opposed to “programming in the large.” The goal of the former is to program single applications or to solve single problems, as opposed to programming an entire enterprise and all of its interacting applications. Researchers from both camps have recently been addressing both components of the “Programs + Databases” equation. For example, database researchers have been adding triggers and procedures to database objects[2], resulting in so called active databases. These are important steps in the right direction (other related steps are listed below), but still do not address the full programming in the large problem. In our opinion, the only software providers that have tackled both components of the “Programs + Databases” equation, and have a proven track record with real applications, are the Transaction Processing Monitor (TPM) builders[9]." SIGMOD Conference Design and Implementation of the Glue-Nail Database System. Marcia A. Derr,Shinichi Morishita,Geoffrey Phipps 1993 We describe the design and implementation of the Glue-Nail database system. The Nail language is a purely declarative query language; Glue is a procedural language used for non-query activities. The two languages combined are sufficient to write a complete application. Nail and Glue code both compile into the target language IGlue. The Nail compiler uses variants of the magic sets algorithm, and supports well-founded models. Static optimization is performed by the Glue compiler using techniques that include peephole methods and data flow analysis. The IGlue code is executed by the IGlue interpreter, which features a run-time adaptive optimizer. The three optimizers each deal with separate optimization domains, and experiments indicate that an effective synergism is achieved. The Glue-Nail system is largely complete and has been tested using a suite of representative applications. SIGMOD Conference "What's Special about Spatial? Database Requirements for Vehicle Navigation in Geographic Space (Extended Abstract)." Max J. Egenhofer 1993 "What's Special about Spatial? Database Requirements for Vehicle Navigation in Geographic Space (Extended Abstract)." SIGMOD Conference Open OODB: A Modular Object-Oriented DBMS. Steve Ford,José A. Blakeley,Thomas J. Bannon 1993 The Open OODB project, part of the DARPA Persistent Object Base (POB) Program, is an effort to build an open, extensible object-oriented database management system (OODB) in which database functionality can be tailored for particular applications within an incrementally improvable framework. The system is designed to serve both as a platform for research and as a testbed that can meet the needs of demanding, next generation database applications. The Open OODB project goals are to describe the design space of OODBs; build an architectural framework that enables configuring independently useful modules to form an OODB; verify the suitability of this open approach by implementing an OODB to these specifications; and identify opportunities for building consensus that can lead to much-needed OODB standards. The motivating factors in this approach were that our previous experience in object-oriented databases had convinced us that different applications have differing requirements, and that a monolithic system is unlikely to meet the needs of many demanding kinds of applications. SIGMOD Conference GREO: A Commercial Database Processor Based on A Pipelined Hardware Sorter. Shinya Fushimi,Masaru Kitsuregawa 1993 GREO: A Commercial Database Processor Based on A Pipelined Hardware Sorter. SIGMOD Conference GOOD: AGraph-Oriented Object Database System. Marc Gemis,Jan Paredaens,Inge Thyssens,Jan Van den Bussche 1993 GOOD: AGraph-Oriented Object Database System. SIGMOD Conference Maintaining Views Incrementally. Ashish Gupta,Inderpal Singh Mumick,V. S. Subrahmanian 1993 We present incremental evaluation algorithms to compute changes to materialized views in relational and deductive database systems, in response to changes (insertions, deletions, and updates) to the relations. The view definitions can be in SQL or Datalog, and may use UNION, negation, aggregation (e.g. SUM, MIN), linear recursion, and general recursion. We first present a counting algorithm that tracks the number of alternative derivations (counts) for each derived tuple in a view. The algorithm works with both set and duplicate semantics. We present the algorithm for nonrecursive views (with negation and aggregation), and show that the count for a tuple can be computed at little or no cost above the cost of deriving the tuple. The algorithm is optimal in that it computes exactly those view tuples that are inserted or deleted. Note that we store only the number of derivations, not the derivations themselves. We then present the Delete and Rederive algorithm, DRed, for incremental maintenance of recursive views (negation and aggregation are permitted). The algorithm works by first deleting a superset of the tuples that need to be deleted, and then rederiving some of them. The algorithm can also be used when the view definition is itself altered. SIGMOD Conference Local Verification of Global Integrity Constraints in Distributed Databases. Ashish Gupta,Jennifer Widom 1993 We present an optimization for integrity constraint verification in distributed databases. The optimization allows a global constraint, i.e. a constraint spanning multiple databases, to be verified by accessing data at a single database, eliminating the cost of accessing remote data. The optimization is based on an algorithm that takes as input a global constraint and data to be inserted into a local database. The algorithm produces a local condition such that if the local data satisfies this condition then, based on the previous satisfaction of the global constraint, the global constraint is still satisfied. If the local data does not satisfy the condition, then a conventional global verification procedure is required. SIGMOD Conference Second-Order Signature: A Tool for Specifying Data Models, Query Processing, and Optimization. Ralf Hartmut Güting 1993 Second-Order Signature: A Tool for Specifying Data Models, Query Processing, and Optimization. SIGMOD Conference Papyrus GIS Demonstration. Waqar Hasan,Michael L. Heytens,Curtis P. Kolovson,Marie-Anne Neimat,Spyros Potamianos,Donovan A. Schneider 1993 The goal of the Papyrus project [3] is to provide tools and services to enable the integration and parallelization of specialized data managers so that data-intensive applications can be constructed easily and efficiently. In our terminology, a data manager (DM) is a set of specialized methods that manage persistent data. A collection of functions defines the interface to a DM and provides the only means of accessing its persistent data. SIGMOD Conference Predicate Migration: Optimizing Queries with Expensive Predicates. Joseph M. Hellerstein,Michael Stonebraker 1993 "The traditional focus of relational query optimization schemes has been on the choice of join methods and join orders. Restrictions have typically been handled in query optimizers by “predicate pushdown” rules, which apply restrictions in some random order before as many joins as possible. These rules work under the assumption that restriction is essentially a zero-time operation. However, today's extensible and object-oriented database systems allow users to define time-consuming functions, which may be used in a query's restriction and join predicates. Furthermore, SQL has long supported subquery predicates, which may be arbitrarily time-consuming to check. Thus restrictions should not be considered zero-time operations, and the model of query optimization must be enhanced. In this paper we develop a theory for moving expensive predicates in a query plan so that the total cost of the plan — including the costs of both joins and restrictions — is minimal. We present an algorithm to implement the theory, as well as results of our implementation in POSTGRES. Our experience with the newly enhanced POSTGRES query optimizer demonstrates that correctly optimizing queries with expensive predicates often produces plans that are orders of magnitude faster than plans generated by a traditional query optimizer. The additional complexity of considering expensive predicates during optimization is found to be manageably small." SIGMOD Conference Real-Time Transaction Scheduling: A Cost Conscious Approach. D. Hong,Theodore Johnson,Sharma Chakravarthy 1993 Real-time databases are an important component of embedded real-time systems. In a real-time database context, transactions must not only maintain the consistency constraints of the database but must also satisfy the timing constraints specified for each transaction. Although several approaches have been proposed to integrate real-time scheduling and database concurrency control methods, none of them take into account the dynamic cost of scheduling a transaction. In this paper, we propose a new cost conscious real-time transaction scheduling algorithm which considers dynamic costs associated with a transaction. Our dynamic priority assignment algorithm adapts to changes in the system load without causing excessive numbers of transaction restarts. Our simulations show its superiority over EDF-HP algorithm. SIGMOD Conference Comparing Rebuild Algorithms for Mirrored and RAID5 Disk Arrays. Robert Y. Hou,Yale N. Patt 1993 Several disk array architectures have been proposed to provide high throughput for transaction processing applications. When a single disk in a redundant array fails, the array continues to operate, albeit in a degraded mode with a corresponding reduction in performance. In addition, the lost data must be rebuilt to a spare disk in a timely manner to reduce the probability of permanent data loss. Several researchers have proposed and examined algorithms for rebuilding the failed disk in a disk array with parity. We examine the use of these algorithms to rebuild a mirrored disk array and compare the rebuild time and performance of the RAID5 and mirrored arrays. Redirection of Reads provides comparable average response times and better rebuild times than Piggybacking for a mirrored array, whereas these two algorithms perform similarly for a RAID5 array. In our experiments comparing the two architectures, a mirrored array has more disks than a RAID5 array and can sustain 150% more I/Os per second during the rebuild process. Even if the size of the RAID5 array is increased to match the mirrored array, the mirrored array reduces response times by up to 60% and rebuild times by up to 45%. SIGMOD Conference Evaluation of Signature Files as Set Access Facilities in OODBs. Yoshiharu Ishikawa,Hiroyuki Kitagawa,Nobuo Ohbo 1993 Object-oriented database systems (OODBs) need efficient support for manipulation of complex objects. In particular, support of queries involving evaluations of set predicates is often required in handling complex objects. In this paper, we propose a scheme to apply signature file techniques, which were originally invented for text retrieval, to the support of set value accesses, and quantitatively evaluate their potential capabilities. Two signature file organizations, the sequential signature file and the bit-sliced signature file, are considered and their performance is compared with that of the nested index for queries involving the set inclusion operator (⊆). We develop a detailed cost model and present analytical results clarifying their retrieval, storage, and update costs. Our analysis shows that the bit-sliced signature file is a very promising set access facility in OODBs. SIGMOD Conference Issues in Multimedia Datbases (Panel). H. V. Jagadish 1993 Issues in Multimedia Datbases (Panel). SIGMOD Conference Concurrency Control and Recovery of Multidatabase Work Flows in Telecommunication Applications. W. Woody Jin,Marek Rusinkiewicz,Linda Ness,Amit P. Sheth 1993 In a research and technology application project at Bellcore, we used multidatabase transactions to model multisystem work flows of telecommunication applications. During the project a prototype scheduler for executing multi-database transactions was developed. Two of the issues addressed in this project were concurrent execution of multi-database transactions and their failure recovery. This paper discusses our use of properties of the application and the telecommunication systems to develop simple and efficient solutions to the concurrency control and recovery problems. SIGMOD Conference Lazy Updates for Distributed Search Structure. Theodore Johnson,Padmashree Krishna 1993 Very large database systems require distributed storage, which means that they need distributed search structures for fast and efficient access to the data. In this paper, we present an approach to maintaining distributed data structures that uses lazy updates, which take advantage of the semantics of the search structure operations to allow for scalable and low-overhead replication. Lazy updates can be used to design distributed search structures that support very high levels of concurrency. The alternatives to lazy update algorithms (eager updates) use synchronization to ensure consistency, while lazy update algorithms avoid blocking. Since lazy updates avoid the use of synchronization, they are much easier to implement than eager update algorithms. We demonstrate the application of lazy updates to the dB-tree, which is a distributed B+ tree that replicates its interior nodes for highly parallel access. We develop a correctness theory for lazy updates so that our algorithms can be applied to other distributed search structures. SIGMOD Conference Performance Evaluation of Ephemeral Logging. John S. Keen,William J. Dally 1993 Ephemeral logging (EL) is a new technique for managing a log of database activity on disk. It does not require periodic checkpoints and does not abort lengthy transactions as frequently as traditional firewall logging for the same amount of disk space. Therefore, it is well suited for highly concurrent databases and applications which have a wide distribution of transaction lifetimes. This paper briefly explains EL and then analyzes its performance. Simulation studies indicate that it can offer significant savings in disk space, at the expense of slightly higher bandwidth for logging and more main memory. The reduced size of the log implies much faster recovery after a crash as well as cost savings. EL is the method of choice in some but not all situations. We assess the limitations of our current knowledge about EL and suggest promising directions for further research. SIGMOD Conference Persistence Software: Bridging Object-Oriented Programming and Relational Databases. Arthur M. Keller,Richard Jensen,Shailesh Agrawal 1993 Building object-oriented applications which access relational data introduces a number of technical issues for developers who are making the transition to C++. We describe these issues and discuss how we have addressed them in Persistence, an application development tool that uses an automatic code generator to merge C++ applications with relational data. We use client-side caching to provide the application program with efficient access to the data. SIGMOD Conference Open DECdtm: Constraint Based Transaction Management. Johannes Klein,Francis Upton IV 1993 Open DECdtm offers portable transaction management services layered on OSF DCE which support the application (TX), resource manager (XA), and transactional DCE RPC (TxRPC) interfaces specified by X/Open. Open DECdtm also provides interoperability with OSI Transaction Processing (OSI TP) and OpenVMS systems using the DECdtm OpenVMS protocol. Protocols executed by Open DECdtm are specified by constraints. This simplifies the development of transactional gateways between different data transfer protocols and transaction models. SIGMOD Conference Atomic Incremental Garbage Collection and Recovery for a Large Stable Heap. Elliot K. Kolodner,William E. Weihl 1993 A stable heap is storage that is managed automatically using garbage collection, manipulated using atomic transactions, and accessed using a uniform storage model. These features enhance reliability and simplify programming by preventing errors due to explicit deallocation, by masking failures and concurrency using transactions, and by eliminating the distinction between accessing temporary storage and permanent storage. Stable heap management is useful for programming languages for reliable distributed computing, programming languages with persistent storage, and object-oriented database systems. Many applications that could benefit from a stable heap (e.g., computer-aided design, computer-aided software engineering, and office information systems) require large amounts of storage, timely responses for transactions, and high availability. We present garbage collection and recovery algorithms for a stable heap implementation that meet these goals and are appropriate for stock hardware. The collector is incremental: it does not attempt to collect the whole heap at once. The collector is also atomic: it is coordinated with the recovery system to prevent problems when it moves and modifies objects. The time for recovery is independent of heap size, even if a failure occurs during garbage collection. SIGMOD Conference NAUDA - A Cooperative, Natural Language Interface to Relational Databases. Detlef Küpper,M. Strobel,Dietmar Rösner 1993 The NAUDA1 System is a cooperative natural (German) language database interface for relational databases. The project is carried out at FAW2 - funded by the state of Baden-Württemberg and IBM Germany. This paper describes the extension of a natural language interface to relational databases with respect to its cooperative behavior. We argue that cooperative support of users is especially important for a complex domain such as environmental protection. In order to enrich traditional database reports our system provides dialog-oriented features such as over-answering (providing more information than explicitly requested) and handling of presupposition failure, as well as presentation-oriented features such as natural language responses and geographical maps. Additional information by the system includes (meta-) information about the domain. SIGMOD Conference On Optimal Processor Allocation to Support Pipelined Hash Joins. Ming-Ling Lo,Ming-Syan Chen,Chinya V. Ravishankar,Philip S. Yu 1993 In this paper, we develop algorithms to achieve optimal processor allocation for pipelined hash joins in a multiprocessor-based database system. A pipeline of hash joins is composed of several stages, each of which is associated with one join operation. The whole pipeline is executed in two phases: (1) the table-building phase, and (2) the tuple-probing phase. We focus on the problem of allocating processors to the stages of a pipeline to minimize the query execution time. We formulate the processor allocation problem as a two-phase mini-max optimization problem, and develop three optimal allocation schemes under three different constraints. The effectiveness of our problem formulation and solution is verified through a detailed tuple-by-tuple simulation of pipelined hash joins. Our solution scheme is general and applicable to any optimal resource allocation problem formulated as a two-phase mini-max problem. SIGMOD Conference A Modeling Study of the TPC-C Benchmark. Scott T. Leutenegger,Daniel M. Dias 1993 The TPC-C benchmark is a new benchmark approved by the TPC council intended for comparing database platforms running a medium complexity transaction processing workload. Some key aspects in which this new benchmark differs from the TPC-A benchmark are in having several transaction types, some of which are more complex than that in TPC-A, and in having data access skew. In this paper we present results from a modelling study of the TPC-C benchmark for both single node and distributed database management systems. We simulate the TPC-C workload to determine expected buffer miss rates assuming an LRU buffer management policy. These miss rates are then used as inputs to a throughput model. From these models we show the following: (i) We quantify the data access skew as specified in the benchmark and show what fraction of the accesses go to what fraction of the data. (ii) We quantify the resulting buffer hit ratios for each relation as a function of buffer size. (iii) We show that close to linear scale-up (about 3% from the ideal) can be achieved in a distributed system, assuming replication of a read-only table. (iv) We examine the effect of packing hot tuples into pages and show that significant price/performance benefit can be thus achieved. (v) Finally, by coupling the buffer simulations with the throughput model, we examine typical disk/memory configurations that maximize the overall price/performance. SIGMOD Conference Algorithms for Loading Parallel Grid Files. Jianzhong Li,Doron Rotem,Jaideep Srivastava 1993 The paper describes three fast loading algorithms for grid files on a parallel shared nothing architecture. The algorithms use dynamic programming and sampling to effectively partition the data file among the processors to achieve maximum parallelism in answering range queries. Each processor then constructs in parallel its own portion of the grid file. Analytical results and simulations are given for the three algorithms. SIGMOD Conference Information Organization Using Rufus. Allen Luniewski,Peter M. Schwarz,Kurt A. Shoens,Jim Stamos,John Thomas 1993 "Computer system users today are inundated with a flood of semi-structured information, such as documents, electronic mail, programs, and images. Today, this information is typically stored in filesystems that provide limited support for organizing, searching, and operating upon this data, all operations that are vital to the ability of users to effectively use this data. Database systems provide good function for organizing, searching, managing and writing applications on structured data. Current database systems are inappropriate for semi-structured information because moving the data into the database breaks all existing applications that use the data. The Rufus system attacks the problems of semi-structured information by using database function to help users manage semi-structured information without requiring that the user's information reside in the database." SIGMOD Conference LH* - Linear Hashing for Distributed Files. Witold Litwin,Marie-Anne Neimat,Donovan A. Schneider 1993 LH* generalizes Linear Hashing to parallel or distributed RAM and disk files. An LH* file can be created from objects provided by any number of distributed and autonomous clients. It can grow gracefully, one bucket at a time, to virtually any number of servers. The number of messages per insertion is one in general, and three in the worst case. The number of messages per retrieval is two in general, and four in the worst case. The load factor can be about constant, 65-95%, depending on the file parameters. The file can also support parallel operations. An LH* file can be much faster than a single site disk file, and/or can hold a much larger number of objects. It can be more efficient than any file with a centralized directory, or a static parallel or distributed hash file. SIGMOD Conference Enhancing Inter-Operability and Data Sharing In Medical Information Systems. Dhamir N. Mannai,Khaled M. Bugrara 1993 "Clinical care generates an immense amount of patient data that has been archived and manipulated by computer-based information systems. Such computer-based medical record systems improved the accessibility of clinical information and made several studies of such information possible. Unfortunately, the care provider's task of retrieving, integrating, and interpreting only those portions of the patient's record that are relevant to a specific clinical problem is actually becoming increasingly difficult. This difficulty can be attributed primarily to the large variety of minimum data sets, the heterogeneous formats used to store the data, the heterogeneous data access methods and procedures, the varying granularity of access to data, the different rigid views of the data, and the lack of inter-operability among the information repositories of such data sets. Recognizing the aforementioned issues, we are engaged in a project to build a multi-database environment tailored for the interoperability of medical information systems. The main building blocks of such a system are a multi-disciplinary minimum data set and a catalogue for the support of interoperability and customization functions. In this paper, we report on the design approach used and describe the general architecture of the system." SIGMOD Conference A Logical Semantics for Object-Oriented Databases. José Meseguer,Xiaolei Qian 1993 Although the mathematical foundations of relational databases are very well established, the state of affairs for object-oriented databases is much less satisfactory. We propose a semantic foundation for object-oriented databases based on a simple logic of change called rewriting logic, and a language called MaudeLog that is based on that logic. Some key advantages of our approach include its logical nature, its simplicity without any need for higher-order features, the fact that dynamic aspects are directly addressed, the rigorous integration of user-definable algebraic data types within the framework, the existence of initial models, and the integration of query, update, and programming aspects within a single declarative language. SIGMOD Conference The COMFORT Prototype: A Step Toward Automated Database Performance Tuning. Axel Mönkeberg,Peter Zabback,Christof Hasse,Gerhard Weikum 1993 The COMFORT Prototype: A Step Toward Automated Database Performance Tuning. SIGMOD Conference "IBM's Relational DBMS Products: Features and Technologies." C. Mohan 1993 This paper very briefly summarizes the features and technologies implemented in the IBM relational DBMS products. The topics covered include record and index management, concurrency control and recovery methods, commit protocols, query optimization and execution techniques, high availability and support for parallelism and distributed data. Some indications of likely future product directions are also given. SIGMOD Conference An Efficient and Flexible Method for Archiving a Data Base. C. Mohan,Inderpal Narang 1993 "We describe an efficient method for supporting incremental and full archiving of data bases (e.g., individual files). Customers archive their data bases quite frequently to minimize the duration of data outage. Because of the growing sizes of data bases and the ever increasing need for high availability of data, the efficiency of the archive copy utility is very important. The method presented here minimizes interferences with concurrent transactions by not acquiring any locks on the data being copied. It significantly reduces disk I/Os by not keeping on data pages any extra tracking information in connection with archiving. These features make the archive copy operation be more efficient in terms of resource consumption compared to other methods. The method is also flexible in that it optionally supports direct copying of data from disks, bypassing the DBMS's buffer pool. This reduces buffer pool pollution and processing overheads, and allows the utility to take advantage of device geometries for efficiently retrieving data. We also describe extensions to the method to accommodate the multisystem shared disks transaction environment. The method tolerates gracefully system failures during the archive copy operation." SIGMOD Conference What to Teach about Datbases (Panel). Amihai Motro 1993 What to Teach about Datbases (Panel). SIGMOD Conference Interoperability Using APPC. Debajyoti Mukhopadhyay 1993 The complex and competitive business world of today needs to access data for various operations from different systems placed at different geographical locations. In order to fulfil this need, one should have a reliable distributed computing environment. Various components of that environment may be supplied by different vendors. This means that the computing environment not only needs to be distributed but also requires interoperability. Today, Interoperability is no more just an idea but a reality. There is also a growing need to support interactions among various systems in a dialog mode. Integrating distributed systems can only help to achieve the goal of developing a reliable distributed computing environment. In this paper, a conceptual framework for an architecture is described in conjunction with Advanced Program-to-Program Communications LU 6.2 protocol to handle that challenge. This architecture discusses the required contracting services for integrating distributed systems. This contracting service has three major components: the contract interaction services, the contract support services, and the communications infrastructure services. SIGMOD Conference VODAK Open Nested Transactions - Visualizing Database Internals. Peter Muth,Thomas C. Rakow 1993 VODAK Open Nested Transactions - Visualizing Database Internals. SIGMOD Conference Issues and Approaches for Migration/Cohabitation between Legacy and new Systems. Rodolphe Nassif,Don Mitchusson 1993 Corporate Subject Data Bases (CSDB) are being introduced to reduce data redundancy, maintain the integrity of the data, provide a uniform data access interface, and have data readily available to make business decisions. During the transition phase, there is a need to maintain Legacy Systems (LS), CSDB, and to synchronize between them. Choosing the right granularity for migration of data and functionality is essential to the success of the migration strategy. Technologies being used to support the transition to CSDB include relational systems supporting stored procedures, remote procedures, expert systems, object-oriented approach, reengineering tools, and data transition tools. For our Customer CSDB to be deployed in 1993, cleanup of data occurs during initial load of the CSDB. Nightly updates are needed during the transition phase to account for operations executed through LS. There is a lack of an integrated set of tools to help in the transition phase. SIGMOD Conference The LRU-K Page Replacement Algorithm For Database Disk Buffering. "Elizabeth J. O'Neil,Patrick E. O'Neil,Gerhard Weikum" 1993 This paper introduces a new approach to database disk buffering, called the LRU-K method. The basic idea of LRU-K is to keep track of the times of the last K references to popular database pages, using this information to statistically estimate the interarrival times of references on a page by page basis. Although the LRU-K approach performs optimal statistical inference under relatively standard assumptions, it is fairly simple and incurs little bookkeeping overhead. As we demonstrate with simulation experiments, the LRU-K algorithm surpasses conventional buffering algorithms in discriminating between frequently and infrequently referenced pages. In fact, LRU-K can approach the behavior of buffering algorithms in which page sets with known access frequencies are manually assigned to different buffer pools of specifically tuned sizes. Unlike such customized buffering algorithms however, the LRU-K method is self-tuning, and does not rely on external hints about workload characteristics. Furthermore, the LRU-K algorithm adapts in real time to changing patterns of access. SIGMOD Conference Database Challenges in Global Information Systems. Joann J. Ordille,Barton P. Miller 1993 Database Challenges in Global Information Systems. SIGMOD Conference Doubly Distorted Mirrors. Cyril U. Orji,Jon A. Solworth 1993 Traditional mirrored disk systems provide high reliability by multiplexing disks. Performance is improved with parallel reads and shorter read seeks. However, writes must be performed by both disks, limiting performance. Doubly distorted mirrors increase the number of physical writes per logical write from 2 to 3, but performs logical writes more efficiently. This reduces the cost of a random logical write to 1/3 of the cost of a read. Moreover, much of the write cost can be absorbed in the rotational latency of the reads, performing under certain conditions all the writes for free. Doubly distorted mirrors achieves a 135% performance improvement over traditional mirrors in the TP1 benchmark. Although these techniques require a disk cache for writes, the cache need not be safe nor is recovery time impacted very much. SIGMOD Conference Partially Preemptive Hash Joins. HweeHwa Pang,Michael J. Carey,Miron Livny 1993 With the advent of real-time and goal-oriented database systems, priority scheduling is likely to be an important feature in future database management systems. A consequence of priority scheduling is that a transaction may lose its buffers to higher-priority transactions, and may be given additional memory when transactions leave the system. Due to their heavy reliance on main memory, hash joins are especially vulnerable to fluctuations in memory availability. Previous studies have proposed modifications to the hash join algorithm to cope with these fluctuations, but the proposed algorithms have not been extensively evaluated or compared with each other. This paper contains a performance study of these algorithms. In addition, we introduce a family of memory-adaptive hash join algorithms that turns out to offer even better solutions to the memory fluctuation problem that hash joins experience. SIGMOD Conference The V3 Video Server - Managing Analog and Digital Video Clips. Thomas C. Rakow,Peter Muth 1993 The V3 Video Server is a demonstration showing a multimedia application developed on top of the VODAK database management system. VODAK is a prototype of an object-oriented and distributed database management system (DBMS) developed at GMD-IPSI. The V3 Video Server allows a user to interactively store, retrieve, manipulate, and present analog and short digital video clips. A video clip consists of a sequence of pictures and corresponding sound. Several attributes like author, title, and a set of keywords are annotated. The highlights of the demonstration are as follows. (1) It is shown that an object-oriented database management systems is very useful for the development of multimedia applications. (2) The video server gives valuable hints for the development of an object-oriented database management system in direction to a multimedia database management system. SIGMOD Conference The CORAL Deductive Database System. Raghu Ramakrishnan,William G. Roth,Praveen Seshadri,Divesh Srivastava,S. Sudarshan 1993 The CORAL Deductive Database System. SIGMOD Conference Implementation of the CORAL Deductive Database System. Raghu Ramakrishnan,Divesh Srivastava,S. Sudarshan,Praveen Seshadri 1993 CORAL is a deductive database system that supports a rich declarative language, provides a wide range of evaluation methods, and allows a combination of declarative and imperative programming. The data can be persistent on disk or can reside in main-memory. We describe the architecture and implementation of CORAL. There were two important goals in the design of the CORAL architecture: (1) to integrate the different evaluation strategies in a reasonable fashion, and (2) to allow users to influence the optimization techniques used so as to exploit the full power of the CORAL implementation. A CORAL declarative program can be organized as a collection of interacting modules and this modular structure is the key to satisfying both these goals. The high level module interface allows modules with different evaluation techniques to interact in a transparent fashion. Further, users can optionally tailor the execution of a program by selecting from among a wide range of control choices at the level of each module. CORAL also has an interface with C++, and users can program in a combination of declarative CORAL, and C++ extended with CORAL primitives. A high degree of extensibility is provided by allowing C++ programmers to use the class structure of C++ to enhance the CORAL implementation. SIGMOD Conference Instrumental Complex of Parallel Software System Development and Operating Environment Support for Distributed Processing within Multitransputer Systems, TRANSSOFT. Boris E. Polyachenko,Filipp I. Andon 1993 Instrumental Complex of Parallel Software System Development and Operating Environment Support for Distributed Processing within Multitransputer Systems, TRANSSOFT. SIGMOD Conference The INterset Concept for Multidatabase System Integration in the Pharmaceutical Industry. Tony Schaller 1993 "The industry trends facing information systems in the 1990's involve a matrix of complex requirements. The migration from centralized mainframe computing to desktop personal computing has created opportunities and challenges in moving access to and control of information closer to the end-user. Within this new wave of computing, there has been a drive to move the access control for information to the desktop of the user. Although, graphical user standards were introduced in order to ease of use and provide a consistent look and feel to desktop applications. This provided some assistance in shielding the complexity of the various systems, however it did not address the difficulties presented in accessing heterogeneous database systems on various platforms." SIGMOD Conference Pegasus Architecture and Design Principles. Ming-Chien Shan 1993 Pegasus Architecture and Design Principles. SIGMOD Conference Using Shared Virtual Memory for Parallel Join Processing. Ambuj Shatdal,Jeffrey F. Naughton 1993 In this paper, we show that shared virtual memory, in a shared-nothing multiprocessor, facilitates the design and implementation of parallel join processing algorithms that perform significantly better in the presence of skew than previously proposed parallel join processing algorithms. We propose two variants of an algorithm for parallel join processing using shared virtual memory, and perform a detailed simulation to investigate their performance. The algorithm is unique in that it employs both the shared virtual memory paradigm and the message-passing paradigm used by current shared-nothing parallel database systems. The implementation of the algorithm requires few modifications to existing shared-nothing parallel database systems. SIGMOD Conference Architecture of the Encina Distributed Transaction Processing Family. Marek Sherman 1993 This paper discusses how the Encina® family of distributed transaction processing software can be used to build reliable, distributed applications. We start with the toolkit components of Encina and how they are used for implementing ACID properties. We then consider how the toolkit can be applied in building higher level components in a DCE environment. We conclude with a discussion of the Encina Monitor, which provides a framework for organizing a collection of machines and servers. SIGMOD Conference Multidatabase Interdependencies in Industry. Amit P. Sheth,George Karabatis 1993 "In this paper we address the problem of data consistency between interrelated data. In industrial environments, lack of consistent data creates difficulties in interoperation between systems and often requires manual interventions to restart operations that fail due to inconsistent data. We report the results of a study to understand applicability, adequacy and advantages of a framework we had proposed earlier to specify interdatabase dependencies in multidatabase environments. We studied several existing Bellcore systems and identified examples of interdependent data. The examples demonstrate that the framework allows precise and detailed specification of complex interdependencies that lead to efficient strategies to enforce the consistency requirements among the corporate data managed in multiple databases. We believe that our specification framework can help in the maintenance of data that meet a business's consistency needs, reduce time consuming and costly manual operations, and provide data of better quality to end users." SIGMOD Conference An Instant and Accurate Estimation Method for Joins and Selection in a Retrieval-Intensive Environment. Wei Sun,Yibei Ling,Naphtali Rishe,Yi Deng 1993 An Instant and Accurate Estimation Method for Joins and Selection in a Retrieval-Intensive Environment. SIGMOD Conference DDB: An Object Oriented Design Data Manager for VLSI CAD. Anoop Singhal,Robert M. Arlein,Chi-Yuan Lo 1993 In this paper we present an object oriented data model for VLSI/CAD data. A design data manager (DDB) based on such a model has been implemented under the UNIX/C++ environment. It has been used by a set of diverse VLSI/CAD applications of our organization. Benchmarks have shown it to perform better as compared to commercial object oriented database systems. In conjunction with the ease of data access, the data manger served to improve software productivity and a modular program architecture for our CAD system. SIGMOD Conference OSAM*KBMS: An Object-Oriented Knowledge Base Management System for Supporting Advanced Applications. Stanley Y. W. Su,Herman Lam,Srinivasa Eddula,Javier Arroyo,Neeta Prasad,Ronghao Zhuang 1993 OSAM*KBMS: An Object-Oriented Knowledge Base Management System for Supporting Advanced Applications. SIGMOD Conference The International Directory Network and Connected Data Information Systems for Research in the Earth and Space Sciences. James R. Thieman 1993 Many researchers are becoming aware of the International Directory Network (IDN), an interconnected federation of international directories to Earth and space science data. These directories may become distributed nodes of a single, virtual master data directory of the future. Not as many are aware, however, of the many Earth-and-space-sciece-relevant information systems which can be accessed automatically from the directories. After determining potentially useful data sets in various disciplines through IDN directories it is becoming increasingly possible to get detailed information about the correlative possibilities of these data sets through the connected guide/catalog and inventory systems. Such capabilities as data set browse, subsetting, analysis, etc. are available now and will be improving in the future. SIGMOD Conference Caching and Database Scaling in Distributed Shard-Nothing Information Retrieval Systems. Anthony Tomasic,Hector Garcia-Molina 1993 A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPECT database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this paper this database is studied by using a trace-driven simulation. We focus on physical index design, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong effect on response time and throughput. Database scaling is explored in two ways. One way assumes an “optimal” configuration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a fixed database size. SIGMOD Conference The Miro DBMS. Michael Stonebraker 1993 This short paper explains the key object-relational (OR) DBMS technology used by the Miro DBMS. SIGMOD Conference The Sequoia 2000 Benchmark. Michael Stonebraker,James Frew,Kenn Gardels,Jeff Meredith 1993 The Sequoia 2000 Benchmark. SIGMOD Conference Parallel Database Processing on the KSR1 Computer. Emy Tseng,David S. Reiner 1993 The Kendall Square Research high performance computer (KSR1) provides a spectrum of parallel database processing techniques to achieve scalability and performance in a shared memory environment. The techniques include running multiple transactions in parallel, decomposing queries into parallel subqueries, running multiple instances of the DBMS and partitioning data over disks. These techniques enable on-line transactions to be run in parallel at high throughput rates and decision-support queries to be parallelized and executed very rapidly. This paper focuses upon two of the parallel database processing techniques used on the KSR1—the Kendall Square Query Decomposer and the Oracle Parallel Server. The Query Decomposer intercepts costly decision support queries and decomposes them into subqueries which are executed in parallel. Parallel Server enables multiple ORACLE instances to run simultaneously on the same database. SIGMOD Conference Towards a Unified Visual Database Access. Kumar V. Vadaparty,Y. Alp Aslandogan,Gultekin Özsoyoglu 1993 Since the development of QBE, over fifty visual query languages have been proposed to facilitate easy database access. Although these languages have introduced some very useful paradigms, a number of these have some severe limitations, such as: (a) not extending beyond the relational model (b) not considering negation and safety, formally (c) using ad hoc constructs, with no analysis of expressivity or complexity done, etc. Note that visual database access is an important issue being revisted, with the emergence of different flavors of object-oriented databases. We believe that there is a need for developing a unified visual query language. Specifically, our goal is to develop a visual query language that has the following properties: (i) It has a few core constructs using which “expert-users” can define new (derived) constructs easily (ii) “Normal users” can use easily either the core or the derived constructs for database querying (iii) It can implement representative constructs of other (textual or visual) query language straightforwardly, and (iv) It has formal semantics, with its theoretical properties, such as complexity, analyzed. We believe that we make a first step towards the above goal by introducing a new logical construct called restricted universal quantifier and combining it with the hierarchical structure of windows to develop a Visual Query Language, called VQL. The core constructs of VQL can encode easily a number of representative constructs of different (about six visual and four non-visual) relational, nested and object-oriented query languages. We also study the theoretical aspects such as safety, complexity, etc., of VQL. SIGMOD Conference Modularity and Tuning Mechanisms in the O2 System. Fernando Vélez 1993 The O2 System is a commercial Object-Oriented Database Management System with a complete development environment and a set of user interface tools. In this presentation, we focus on the modularity and application tuning facilities of the system. SIGMOD Conference A Deductive and Object-Oriented Database System: Why and How? Laurent Vieille 1993 "This talk will outline the principles, the architecture and the potential target applications of a Deductive and Object-Oriented Database System (DOOD). Such systems combine the novel functionalities (relying on the associated technology) developed in deductive database projects, the ability to manipulate the complex objects appearing in many applications and the architectural advances achieved by Object-Oriented DBMS's." SIGMOD Conference Single Logical View over Enterprise-Wide Distributed Databases. Andrew E. Wade 1993 "Two trends in today's corporate world demand distribution: downsizing from centralized mainframe single database environments; and wider integration, connecting finance, engineering, manufacturing information systems for enterprise-wide modeling and operations optimization. The resulting environment consists of multiple databases, at the group level, department level, and corporate level, with the need to manage dependencies among data in all of them. The solution is full distribution, providing a single logical view to objects anywhere, from anywhere. Users see a logical model of objects connected to objects, with atomic transactions and propagating methods, even if composite objects are split among multiple databases, each under separate administrative control, on multiple, heterogeneous platforms, operating systems, and network protocols. Support for production environments includes multiple schemas, which may be shared among databases, private, or encrypted, dynamic addition of schemas, and schema evolution. Finally, the logical view must remain valid, and applications must continue to work, as the mapping to the physical environment changes, moving objects and databases to new platforms." SIGMOD Conference Temporal Modules: An Approach Toward Federated Temporal Databases. Xiaoyang Sean Wang,Sushil Jajodia,V. S. Subrahmanian 1993 In a federated database environment, different constituents of the federation may use different temporal models or physical representations for temporal information. This paper introduces a new concept, called a temporal module, to resolve these differences, or mismatches, among the constituents. Intuitively, a temporal module hides the implementation details of a temporal relation by exposing its information only through two windowing functions: The first function associates each time point with a set of tuples and the second function links each tuple to a set of time points. A calculus-style language is given to form queries on temporal modules. Temporal modules are then extended to resolve another type of mismatch among the constituents of a federation, namely, the mismatch involving different time units (e.g., month, week and day) used to record temporal information. Our solution relies on “information conversions” provided by each constituent. Specifically, each temporal module is extended to provide several “windows” to its information, each in terms of a different time unit. The first step to process a query addressed to the federation is to select suitable windows to the underlying temporal modules. In order to facilitate such a process, time units are formally defined and studied. A federated temporal database model and its query language are proposed. The query language is an extension of the above calculus-style language. SIGMOD Conference Interpreting a Reconstructed Relational Calculus (Extended Abstract). Aaron Watters 1993 This paper describes a method for answering all relational calculus queries under the assumption that the domain of data values is sufficiently large. The method extends recent theoretical results that use extended relation representations to answer domain dependent queries, without the use of auxilliary variables or invented constants or an explicit enumeration of the active domain. The method is shown to be logically correct and to have polynomial data complexity. By identifying relational algebra operations with relational calculus queries this approach extends relational algebra to a full boolean algebra, where intersection, union, and difference are defined between any two relations, whether or not they are union compatible. An example illustrates that this approach can be useful in distributed query optimization. SIGMOD Conference Intelligent Integration of Information. Gio Wiederhold 1993 "This paper describes and classifies methods to transform data to information in a three-layer, mediated architecture. The layers can be characterized from the top down as information-consuming applications, mediators which perform intelligent integration of information (I3), and data, knowledge and simulation resources. The objective of modules in the I3 architecture is to provide end users' apoplications with information obtained through selection, abstraction, fusion, caching, extrapolation, and pruning of data. The data is obtained from many diverse and heterogeneous sources. The I3 objective requires the establishment of a consensual information system architecture, so that many participants and technologies can contribute. An attempt to provide such a range of services within a single, tightly integrated system is unlikely to survive technological or environmental change. This paper focuses on the computational models needed to support the mediating functions in this architecture and introduces initial applicatiions. The architecture has been motivated in [Wied:92C]." SIGMOD Conference Task Scheduling Using Intertask Dependencies in Carot. Darrell Woelk,Paul C. Attie,Philip Cannata,Greg Meredith,Amit P. Sheth,Munindar P. Singh,Christine Tomlinson 1993 The Carnot Project at MCC is addressing the problem of logically unifying physically-distributed, enterprise-wide, heterogeneous information. Carnot will provide a user with the means to navigate information efficiently and transparently, to update that information consistently, and to write applications easily for large, heterogeneous, distributed information systems. A prototype has been implemented which provides services for (a) enterprise modeling and model integration to create an enterprise-wide view, (b) semantic expansion of queries on the view to queries on individual resources, and (c) inter-resource consistency management. This paper describes the Carnot approach to transaction processing in environments where heterogeneous, distributed, and autonomous systems are required to coordinate the update of the local information under their control. In this approach, subtransactions are represented as a set of tasks and a set of intertask dependencies that capture the semantics of a particular relaxed transaction model. A scheduler has been implemented which schedules the execution of these tasks in the Carnot environment so that all intertask dependencies are satisfied. SIGMOD Conference Modeling Battlefield Sensor Environments with an Object Database Management System. Mark A. Woyna,John H. Christiansen,Christopher W. Hield,Kathy Lee Simunich 1993 "The Visual Intelligence and Electronic Warfare Simulation (VIEWS) Workbench software system has been developed by Argonne National Laboratory (ANL) to enable Army intelligence and electronic warfare (IEW) analysts at Unix workstations to conveniently build detailed IEW battlefield scenarios, or “sensor environments”, to drive the Army's high-resolution IEW sensor performance models. VIEWS is fully object-oriented, including the underlying database." SIGMOD Conference Incremental Database Systems: Databases from Ground Up. Stanley B. Zdonik 1993 Incremental Database Systems: Databases from Ground Up. VLDB Data Sharing Analysis for a Database Programming Lanaguage via Abstract Interpretation. Giuseppe Amato,Fosca Giannotti,Gianni Mainetto 1993 Data Sharing Analysis for a Database Programming Lanaguage via Abstract Interpretation. VLDB Querying and Updating the File. Serge Abiteboul,Sophie Cluet,Tova Milo 1993 Querying and Updating the File. VLDB An Object Data Model with Roles. Antonio Albano,Roberto Bergamini,Giorgio Ghelli,Renzo Orsini 1993 An Object Data Model with Roles. VLDB Specifying and Enforcing Intertask Dependencies. Paul C. Attie,Munindar P. Singh,Amit P. Sheth,Marek Rusinkiewicz 1993 Specifying and Enforcing Intertask Dependencies. VLDB Object Database Morphology. François Bancilhon 1993 Object Database Morphology. VLDB Collections of Objects in SQL3. David Beech 1993 Collections of Objects in SQL3. VLDB Object-Oriented Database Systems: Promises, Reality, and Future. Won Kim 1993 Object-Oriented Database Systems: Promises, Reality, and Future. VLDB STDL - A Portable Language for Transaction Processing. Philip A. Bernstein,Per O. Gyllstrom,Tom Wimberg 1993 STDL - A Portable Language for Transaction Processing. VLDB Toward Practical Constraint Databases. Alexander Brodsky,Joxan Jaffar,Michael J. Maher 1993 Toward Practical Constraint Databases. VLDB Managing Memory to Meet Multiclass Workload Response Time Goals. Kurt P. Brown,Michael J. Carey,Miron Livny 1993 Managing Memory to Meet Multiclass Workload Response Time Goals. VLDB Managing Semantic Heterogeneity with Production Rules and Persistent Queues. Stefano Ceri,Jennifer Widom 1993 Managing Semantic Heterogeneity with Production Rules and Persistent Queues. VLDB Declustering Objects for Visualization. Ling Tony Chen,Doron Rotem 1993 Declustering Objects for Visualization. VLDB Adaptive Database Buffer Allocation Using Query Feedback. Chung-Min Chen,Nick Roussopoulos 1993 Adaptive Database Buffer Allocation Using Query Feedback. VLDB Managing Temporal Financial Data in an Extensible Database. Rakesh Chandra,Arie Segev 1993 Managing Temporal Financial Data in an Extensible Database. VLDB Query Optimization in the Presence of Foreign Functions. Surajit Chaudhuri,Kyuseok Shim 1993 Query Optimization in the Presence of Foreign Functions. VLDB A Practical Issue Concerning Very Large Data Bases: The Need for Query Governors. Gerald Cohen 1993 A Practical Issue Concerning Very Large Data Bases: The Need for Query Governors. VLDB An Adaptive Algorithm for Incremental Evaluation of Production Rules in Databases. Françoise Fabret,Mireille Régnier,Eric Simon 1993 An Adaptive Algorithm for Incremental Evaluation of Production Rules in Databases. VLDB Problems/Challenges facing Industry Data Base Users. Kevin Fitzgerald 1993 Problems/Challenges facing Industry Data Base Users. VLDB Local Disk Caching for Client-Server Database Systems. Michael J. Franklin,Michael J. Carey,Miron Livny 1993 Local Disk Caching for Client-Server Database Systems. VLDB A Model of Methods Access Authorization in Object-oriented Databases. Nurit Gal-Oz,Ehud Gudes,Eduardo B. Fernández 1993 A Model of Methods Access Authorization in Object-oriented Databases. VLDB On Implementing a Language for Specifying Active Database Execution Models. Shahram Ghandeharizadeh,Richard Hull,Dean Jacobs,Jaime Castillo,Martha Escobar-Molano,Shih-Hui Lu,Junhui Luo,Chiu Tsang,Gang Zhou 1993 On Implementing a Language for Specifying Active Database Execution Models. VLDB Combining Theory and Practice in Integrity Control: A Declarative Approach to the Specification of a Transaction Modification Subsystem. Paul W. P. J. Grefen 1993 Combining Theory and Practice in Integrity Control: A Declarative Approach to the Specification of a Transaction Modification Subsystem. VLDB Managing Derived Data in the Gaea Scientific DBMS. Nabil I. Hachem,Ke Qiu,Michael A. Gennert,Matthew O. Ward 1993 Managing Derived Data in the Gaea Scientific DBMS. VLDB Performance of Catalog Management Schemes for Running Access Modules in a Locally Distributed Database System. Eui Kyeong Hong 1993 Performance of Catalog Management Schemes for Running Access Modules in a Locally Distributed Database System. VLDB Update Logging for Persistent Programming Languages: A Comparative Performance Evaluation. Antony L. Hosking,Eric W. Brown,J. Eliot B. Moss 1993 Update Logging for Persistent Programming Languages: A Comparative Performance Evaluation. VLDB Implementation and Performance Evaluation of a Parallel Transitive Closure Algorithm on PRISMA/DB. Maurice A. W. Houtsma,Annita N. Wilschut,Jan Flokstra 1993 Implementation and Performance Evaluation of a Parallel Transitive Closure Algorithm on PRISMA/DB. VLDB Universality of Serial Histograms. Yannis E. Ioannidis 1993 Universality of Serial Histograms. VLDB An Active Object-Oriented Database: A Multi-Paradigm Approach to Constraint Management. Hiroshi Ishikawa,Kazumi Kubota 1993 An Active Object-Oriented Database: A Multi-Paradigm Approach to Constraint Management. VLDB Recovering from Main-Memory Lapses. H. V. Jagadish,Abraham Silberschatz,S. Sudarshan 1993 Recovering from Main-Memory Lapses. VLDB Database Research Strategies of Funding Agencies (Panel). Keith G. Jeffery 1993 Database Research Strategies of Funding Agencies (Panel). VLDB A Blackboard Architecture for Query Optimization in Object Bases. Alfons Kemper,Guido Moerkotte,Klaus Peithner 1993 A Blackboard Architecture for Query Optimization in Object Bases. VLDB Mobile Computing: Fertile Research Area or Black Hole? (Panel). Henry F. Korth,Tomasz Imielinski 1993 Mobile Computing: Fertile Research Area or Black Hole? (Panel). VLDB A New Presumed Commit Optimization for Two Phase Commit. Butler W. Lampson,David B. Lomet 1993 A New Presumed Commit Optimization for Two Phase Commit. VLDB On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces. Rosana S. G. Lanzelotte,Patrick Valduriez,Mohamed Zaït 1993 On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces. VLDB Queries Independent of Updates. Alon Y. Levy,Yehoshua Sagiv 1993 Queries Independent of Updates. VLDB Key Range Locking Strategies for Improved Concurrency. David B. Lomet 1993 Key Range Locking Strategies for Improved Concurrency. VLDB Exploiting A History Database for Backup. David B. Lomet,Betty Salzberg 1993 Exploiting A History Database for Backup. VLDB The Voice of the Customer: Innovative and Useful Research Directions (Panel). Stuart E. Madnick 1993 The Voice of the Customer: Innovative and Useful Research Directions (Panel). VLDB Dynamic Memory Allocation for Multiple-Query Workloads. Manish Mehta,David J. DeWitt 1993 Dynamic Memory Allocation for Multiple-Query Workloads. VLDB The Use of Information Capacity in Schema Integration and Translation. Renée J. Miller,Yannis E. Ioannidis,Raghu Ramakrishnan 1993 The Use of Information Capacity in Schema Integration and Translation. VLDB Control of an Extensible Query Optimizer: A Planning-Based Approach. Gail Mitchell,Umeshwar Dayal,Stanley B. Zdonik 1993 Control of an Extensible Query Optimizer: A Planning-Based Approach. VLDB A Cost-Effective Method for Providing Improved Data Availability During DBMS Restart Recovery After a Failure. C. Mohan 1993 A Cost-Effective Method for Providing Improved Data Availability During DBMS Restart Recovery After a Failure. VLDB Memory-Adaptive External Sorting. HweeHwa Pang,Michael J. Carey,Miron Livny 1993 Memory-Adaptive External Sorting. VLDB The Need for Data Quality. Blake Patterson 1993 The Need for Data Quality. VLDB Integrity Constraint and Rule Maintenance in Temporal Deductive Knowledge Bases. Dimitris Plexousakis 1993 Integrity Constraint and Rule Maintenance in Temporal Deductive Knowledge Bases. VLDB Disk Mirroring with Alternating Deferred Updates. Christos A. Polyzois,Anupam Bhide,Daniel M. Dias 1993 Disk Mirroring with Alternating Deferred Updates. VLDB Towards a Formal Approach for Object Database Design. Pascal Poncelet,Maguelonne Teisseire,Rosine Cicchetti,Lotfi Lakhal 1993 Towards a Formal Approach for Object Database Design. VLDB A Domain-theoretic Approach to Integrating Functional and Logic Database Languages. Alexandra Poulovassilis,Carol Small 1993 A Domain-theoretic Approach to Integrating Functional and Logic Database Languages. VLDB Analysis of Dynamic Load Balancing Strategies for Parallel Shared Nothing Database Systems. Erhard Rahm,Robert Marek 1993 Analysis of Dynamic Load Balancing Strategies for Parallel Shared Nothing Database Systems. VLDB Database Requirements of Knowledge-based Production Scheduling and Control: A CIM Perspective. Ulf Schreier 1993 Database Requirements of Knowledge-based Production Scheduling and Control: A CIM Perspective. VLDB Reading a Set of Disk Pages. Bernhard Seeger,Per-Åke Larson,Ron McFadyen 1993 Reading a Set of Disk Pages. VLDB Predictions and Challenges for Database Systems in the Year 2000. Patricia G. Selinger 1993 Predictions and Challenges for Database Systems in the Year 2000. VLDB Multi-Join Optimization for Symmetric Multiprocessors. Eugene J. Shekita,Honesty C. Young,Kian-Lee Tan 1993 Multi-Join Optimization for Symmetric Multiprocessors. VLDB The Rufus System: Information Organization for Semi-Structured Data. Kurt A. Shoens,Allen Luniewski,Peter M. Schwarz,James W. Stamos,Joachim Thomas II 1993 The Rufus System: Information Organization for Semi-Structured Data. VLDB Coral++: Adding Object-Orientation to a Logic Database Language. Divesh Srivastava,Raghu Ramakrishnan,Praveen Seshadri,S. Sudarshan 1993 Coral++: Adding Object-Orientation to a Logic Database Language. VLDB DBMS Research at a Crossroads: The Vienna Update. Michael Stonebraker,Rakesh Agrawal,Umeshwar Dayal,Erich J. Neuhold,Andreas Reuter 1993 DBMS Research at a Crossroads: The Vienna Update. VLDB Tioga: Providing Data Management Support for Scientific Visualization Applications. Michael Stonebraker,Jolly Chen,Nobuko Nathan,Caroline Paxson,Jiang Wu 1993 Tioga: Providing Data Management Support for Scientific Visualization Applications. VLDB Viewers: A Data-World Analogue of Procedure Calls. Kazimierz Subieta,Florian Matthes,Joachim W. Schmidt,Andreas Rudloff 1993 Viewers: A Data-World Analogue of Procedure Calls. VLDB Versions of Simple and Composite Objects. Guilaine Talens,Chabane Oussalah,M. F. Colinas 1993 Versions of Simple and Composite Objects. VLDB A Plan-Operator Concept for Client-Based Knowledge Progressing. Joachim Thomas,Stefan Deßloch 1993 A Plan-Operator Concept for Client-Based Knowledge Progressing. VLDB Applying Hash Filters to Improving the Execution of Bushy Trees. Ming-Syan Chen,Hui-I Hsiao,Philip S. Yu 1993 Applying Hash Filters to Improving the Execution of Bushy Trees. VLDB NCR 3700 - The Next-Generation Industrial Database Computer. Andrew Witkowski,Felipe Cariño,Pekka Kostamaa 1993 NCR 3700 - The Next-Generation Industrial Database Computer. VLDB Algebraic Optimization of Computations over Scientific Databases. Richard H. Wolniewicz,Goetz Graefe 1993 Algebraic Optimization of Computations over Scientific Databases. VLDB Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files. Justin Zobel,Alistair Moffat,Ron Sacks-Davis 1993 Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files. VLDB Hamming Filters: A Dynamic Signature File Organization for Parallel Stores. Pavel Zezula,Paolo Ciaccia,Paolo Tiberio 1993 Hamming Filters: A Dynamic Signature File Organization for Parallel Stores. SIGMOD Record Helping Computer Scientists in Romania. I. Athanasiu 1993 As a Professor in the Computer Science Department of the Polytechnical Institute of Bucharest, Romania, I would like to bring to your attention the effort of a group of professionals, researchers, students and professors around the world. This effort, called “Free UNIX for Romania”, has been initiated and coordinated by Marius Hancu, an advisor at the Parallel Architectures Group, Center de Recherche Informatique de Montreal, Montreal, Canada. SIGMOD Record Bibliography on Spatiotemporal Databases. Khaled K. Al-Taha,Richard T. Snodgrass,Michael D. Soo 1993 "Spatial and temporal databases are important and well-established sub-disciplines of database research. Some 350 papers in temporal databases have appeared, authored by almost 300 researchers. The literature on spatial databases is also substantial; the bibliography of Samet's landmark book on spatial data structures lists 823 references." SIGMOD Record Extending the Scope of Database Services. Daniel Barbará 1993 A wide variety of important data remains outside of the scope of database management systems. In recent years researchers have been making efforts in two directions: first, developing database management systems that can support non-traditional data, and secondly, offering database services for data that remains under control of autonomous (not necessarily database) systems. The purpose of this paper is to provide a brief review of these efforts. SIGMOD Record MoodView: An Advanced Graphical User Interface for OODBMSs. Ismailcem Budak Arpinar,Asuman Dogac,Cem Evrendilek 1993 OODBMSs need more than declarative query languages and programming languages as their interfaces since they are designed and implemented for complex applications requiring more advanced and easy to use visual interfaces. We have developed a complete programming environment for this purpose, called MoodView. MoodView translates all the user actions performed through its graphical interface to SQL statements and therefore it can be ported onto any object-oriented database systems using SQL. MoodView provides the database programmer with tools and functionalities for every phase of object oriented database application development. Current version of MoodView allows a database user to design, browse, and modify database schema interactively and to display class inheritance hierarchy as a directed acyclic graph. MoodView can automatically generate graphical displays for complex and multimedia database objects which can be updated through the object browser. Furthermore, a database administration tool, a full screen text-editor, a SQL based query manager, and a graphical indexing tool for the spatial data, i.e., R Trees are also implemented. SIGMOD Record Merging Application-centric and Data-centric Approaches to Support Transaction-oriented Multi-system Workflows. Yuri Breitbart,Andrew Deacon,Hans-Jörg Schek,Amit P. Sheth,Gerhard Weikum 1993 Workflow management is primarily concerned with dependencies between the tasks of a workflow, to ensure correct control flow and data flow. Transaction management, on the other hand, is concerned with preserving data dependencies by preventing execution of conflicting operations from multiple, concurrently executing tasks or transactions. In this paper we argue that many applications will be served better if the properties of transaction and workflow models are supported by an integrated architecture. We also present preliminary ideas towards such an architecture. SIGMOD Record Remarks on Two New Theorems of Date and Fagin. H. W. Buff 1993 Remarks on Two New Theorems of Date and Fagin. SIGMOD Record SIRIO: A Distributed Information System over a Heterogeneous Computer Network. Carmen Costilla,M. J. Bas,J. Villamor 1993 This paper presents the SIRIO project, a commercial experience within Spanish industry, developed for Tecnatom S.A. by the Data and Knowledge Bases Research Group at the Technical University of Madrid. SIRIO runs over a heterogeneous local area network, with a client-server architecture, using the following tools: Oracle as RDBMS, running over a Unix server, TCP/IP as a communication protocol, Ethernet TOOLKIT for the distributed client-server architecture, and C as the host programming language for the distributed applications (every one of them is rather complex and very different from the rest). The system uses computers with MS-DOS, connected to the server over the LAN. SIRIO is mainly based on the conceptual design of an Rdb, upon which several distributed applications are operational as big software modules. These applications are: 1.- The inspection programs, the management of their corresponding criteria and the automatic generation of queries. 2.- Graphics processing and interface definition. 3.- Interactive Rdb updating. 4.- Historical db management. 5.- Massive load of on-field obtained data. 6.- Report and query application. The approach of the SIRIO integrated information system presented here is a pioneering one. There are about two dozen companies worldwide in this field and none has developed such an advanced system to this day. From 1992, SIRIO is totally operational in the Tecnatom S.A. industry. It constitutes an important tool to obtain the reports (from different plants) for the clients, for the State control organizations, and for the specialized analyst staff. SIGMOD Record "Response to ""Remarks on Two New Theorems of Date and Fagin""." C. J. Date,Ronald Fagin 1993 "Response to ""Remarks on Two New Theorems of Date and Fagin""." SIGMOD Record Report of the Workshop on Semantic Heterogeneity and Interoperation in Multidatabase Systems. Pamela Drew,Roger King,Dennis McLeod,Marek Rusinkiewicz,Abraham Silberschatz 1993 This report presents a review of the problems that were discussed during the Workshop on Semantic Heterogeneity and Interoperability in Multidatabase Systems. The workshop participants discussed the importance of interoperation in the U S WEST information processing environment and the progress that has been achieved in three major research areas: resolution of semantic heterogeneity among cooperating heterogeneous systems, transaction management in such environments, and software architecture for interoperation. The workshop provided researchers with necessary feedback from the industrial perspective and helped in identifying the major issues that need further research. The following problems concerning the applicability of the methods proposed for data processing in the heterogeneous, autonomous information systems have been identified: (i) many of the assumptions made in the research community are too restrictive to make the results directly applicable to existing environments; (ii) performance ramifications of various heterogeneous architectures need to be understood; (iii) the prototype systems need to be put to the test with real data, schemas, and transaction streams to verify their utility. SIGMOD Record PARDES - A Data-Driven Oriented Active Database Model. Opher Etzion 1993 Most active database models adopted an event-driven approach in which whenever a given event occurs the database triggers some actions. Many derivations are data-driven by nature, deriving the values of data-elements as a function of the values of other derived data-elements. The handling of such rules by current active databases suffers from semantic and pragmatic fallacies. This paper explores these fallacies and reports about the PARDES language and supporting architecture, aiming at the support of data-driven rules, in an active database framework. SIGMOD Record Parametric Databases: Seamless Integration of Spatial, Temporal, Belief and Ordinary Data. Shashi K. Gadia 1993 Our model, algebra and SQL-like query language for temporal databases extend naturally to parametric data, of which spatial, temporal, spatio-temporal, belief and ordinary data are special cases. SIGMOD Record Options in Physical Database Design. Goetz Graefe 1993 A cornerstone of modern database systems is physical data independence, i.e., the separation of a type and its associated operations from its physical representation in memory and on storage media. Users manipulate and query data at the logical level; the DBMS translates these logical operations to operations on files, indices, records, and disks. The efficiency of these physical operations depends very much on the choice of data representations. Choosing a physical representation for a logical database is called physical database design. The number of possible choices in physical database design is very large; moreover, they very often interact with each other. We attempt to list and classify these choices and to explore their interactions. The purpose of this paper is to provide an overview of possible options to the DBMS developer and some guidance to the DBMS administrator and user. While much of our discussion will draw on the relational data model, physical database design is of even more importance for object-oriented and extensible systems. The reasons are simple: First, the number of logical data types and their operations is larger, requiring and permitting more choices for their representation. Second, the state of the art in query optimization for these systems is much less developed than for relational systems, making careful physical database design even more imperative for object-oriented database systems. SIGMOD Record Data Management for Mobile Computing. Tomasz Imielinski,B. R. Badrinath 1993 Mobile Computing is a new emerging computing paradigm of the future. Data Management in this paradigm poses many challenging problems to the database community. In this paper we identify these new challenges and plan to investigate their technical significance. New research problems include management of location dependent data, wireless data broadcasting, disconnection management and energy efficient data access. SIGMOD Record A Performance Study of Concurrency Control in a Real-Time Main Memory Database System. Le Gruenwald,Sichen Liu 1993 Earlier performance studies of concurrency control algorithms show that in a disk-resident real-time database system, optimistic algorithms perform better than two phase locking with higher priority (2PL-HP). In a main memory real-time database system, disk I/Os are eliminated and thus more transactions are enabled to meet their real-time constraints. Lack of disk I/Os in this environment requires concurrency control be re-examined. This paper conducts a simulation study to compare 2PL-HP with a real time optimistic concurrency control algorithm (OPT-WAIT-50) for a real time main memory database system, MARS. The results show that OPT-WAIT-50 outperforms 2PL-HP with finite resources. SIGMOD Record Database Research at AT&T Bell Laboratories. H. V. Jagadish 1993 Database Research at AT&T Bell Laboratories. SIGMOD Record NSF Workshop on Visual Information Management Systems. Ramesh Jain 1993 "One of the most important technologies needed across many traditional areas as well as emerging new frontiers of computing, is the management of visual information. For example, most of the Grand Challenge applications, under the High Performance Computing and Communication (HPCC) initiative, require management of large volumes of non-alphanumeric information, computations, communication, and visualization of results. Considering the growing need and interest in the organization and retrieval of visual and other non-alphanumeric information, and in order to stimulate academic projects in this area, a workshop on Visual Information Management Systems (VIMS) was sponsored by the National Science Foundation. This workshop was held in Redwood, CA, on February 24-25, 1992. The goal of the workshop was to identify major research areas that should be addressed by researchers for VIMS that would be useful in scientific, industrial, medical, environmental, educational, entertainment, and other applications. The major findings of the workshop were that VIMS require new techniques in all aspects of databases, computer vision, and knowledge representation and management; and that such techniques are best developed in the context of concrete, practical applications. VIMS will provide impetus and testbeds for many techniques being explored for the future database systems. Researchers from image processing and understanding, knowledge representation and knowledge based systems, and databases must work very closely to develop VIMS. Such systems should be developed in the context of applications that will be of immediate interest in industrial, medical, or scientific contexts. Without concrete applications and ambitious implementation projects, most of the important and difficult issues are likely to be ignored. Considering the interdisciplinary nature of the research in this area, a few major research projects in this area are essential for its growth. Increased emphasis on HPCC by many Federal agencies can help in the rapid development of VIMS technology. Similarly, by addressing some of the Grand Challenges, research interested in VIMS can understand critical issues and develop techniques to solve them, in a concrete and useful context. Parallel processing is essential for implementing VIMS. As is well known, the processing of images is one of the most computation-intensive tasks. For entering images in databases and for performing required operations at query time, an enormous volume of data must be processed. Parallel computing will be essential for implementing a VIMS that can insert images in reasonable time and provide fast response to user queries. The computational requirements of video databases are likely to be one of the most demanding. It is very likely that video databases will require research in highly parallel-pipelined architectures. In interdisciplinary research areas such as VIMS, most important and difficult problems usually fall through the cracks. The three most relevant areas for the development of VIMS are: databases, computer vision, and knowledge representation. Data compression, fault-tolerant real time access to image data through networks, and parallel processing issues should be addressed in the context of databases for VIMS. VIMS should not be considered as an application of the existing state of the art in any of these fields to manage and process images. Database researchers must understand the issues specific to managing and processing images and other forms of data by granting them the same status that has been given to alphanumeric information. Computer vision researchers should identify features required for interactive image understanding, rather than their discipline's current emphasis on automatic techniques, and develop techniques to compute features in interactive environments. Most knowledge representation research has been concerned with symbolic k knowledge. For VIMS and HPCC applications, techniques for representing symbolic and non-symbolic representations at the same level will be required. Reasoning approaches that can deal with such representations will be useful not only in VIMS, but in many other applications also. Finally, performance issues pose a significant challenge in all aspects of VIMS, from memory organization to information retrieval." SIGMOD Record Database Conference Calendar / Calls For Papers. Keith G. Jeffery 1993 Database Conference Calendar / Calls For Papers. SIGMOD Record Database Conference Calendar. Keith G. Jeffery 1993 Database Conference Calendar. SIGMOD Record "Chair's Message." Won Kim 1993 "Chair's Message." SIGMOD Record "Chair's Message." Won Kim 1993 "Chair's Message." SIGMOD Record "Chair's Message." Won Kim 1993 "Chair's Message." SIGMOD Record An Update of the Temporal Database Bibliography. Nick Kline 1993 An Update of the Temporal Database Bibliography. SIGMOD Record Implementation of a Graph-Based Data Model for Complex Objects. Mark Levene,Alexandra Poulovassilis,Kerima Benkerimi,Sara Schwartz,Eran Tuv 1993 We have developed a graph-based data model called the Hypernode Model whose single data structure is the hypernode, a directed graph whose nodes may themselves reference further directed graphs. A prototype database system supporting this model is being developed at London University as part of a project whose aims are threefold: (i) to ascertain the expressiveness and flexibility of the hypernode model, (ii) to experiment with various querying paradigms for this model, and (iii) to investigate the suitability of the directed graph as a data structure supported throughout all levels of the implementation. The purpose of this paper is to report upon our findings to date. SIGMOD Record A Survey on Usage of SQL. Hongjun Lu,Hock Chuan Chan,Kwok Kee Wei 1993 Relational database systems have been on market for more than a decade. SQL has been accepted as the standard query language of relational systems. To further understand the usage of relational systems and relational query language SQL, we conducted a survey recently that covers various aspects of the usage of SQL in industrial organizations. In this paper, we present those results that may interest DBMS researcher and developers, including the profiles of SQL users, the application areas where SQL is used, the usage of different features of SQL and difficulties encountered by SQL users. SIGMOD Record Database Research at the Data-Intensive Systems Center. David Maier,Lois M. L. Delcambre,Calton Pu,Jonathan Walpole,Goetz Graefe,Leonard D. Shapiro 1993 Database Research at the Data-Intensive Systems Center. SIGMOD Record Schema Evolution in OODBs Using Class Versioning. Simon R. Monk,Ian Sommerville 1993 This paper describes work carried out on a model for the versioning of class definitions in an object-oriented database. By defining update and backdate functions on attributes of the previous and current version of a class definition, instances of any version of the class can be converted to instances of any other version. This allows programs written to access an old version of the schema to still use data created in the format of the changed schema. SIGMOD Record Database Research at the University of Queensland. Maria E. Orlowska 1993 Database Research at the University of Queensland. SIGMOD Record Workshop Report: International Workshop on Distributed Object Management. M. Tamer Özsu,Umeshwar Dayal,Patrick Valduriez 1993 The International Workshop on Distributed Object Management (IWDOM) was organized in Edmonton, Canada on the University of Alberta campus between August 19-21, 1992. The Workshop addressed the theory and practice of developing distributed object-oriented database management systems (OODBMS). The objective was to take note of the state-of-the-art in distributed object management technology and to identify the issues that need to be resolved. The recent developments in standards-related activities, notably the release of the CORBA specifications by the Object Management Group, contributed to the timeliness and importance of the Workshop. SIGMOD Record Role-Based Security, Object Oriented Databases & Separation of Duty. Matunda Nyanchama,Sylvia L. Osborn 1993 Role-Based Security, Object Oriented Databases & Separation of Duty. SIGMOD Record On Temporal Modeling in the Context of Object Databases. Niki Pissinou,Kia Makki,Yelena Yesha 1993 On Temporal Modeling in the Context of Object Databases. SIGMOD Record Parallel Query Processing in Shared Disk Database Systems. Erhard Rahm 1993 System developments and research on parallel query processing have concentrated either on “Shared Everything” or “Shared Nothing” architectures so far. While there are several commercial DBMS based on the “Shared Disk” alternative, this architecture has received very little attention with respect to parallel query processing. A comparison between Shared Disk and Shared Nothing reveals many potential benefits for Shared Disk with respect to parallel query processing. In particular, Shared Disk supports more flexible control over the communication overhead for intra-transaction parallelism, and a higher potential for dynamic load balancing and efficient processing of mixed OLTP/query workloads. We also sketch necessary extensions for transaction management (concurrency/coherency control, logging/recovery) to support intra-transaction parallelism in the Shared Disk environment. SIGMOD Record Deadlock Prevention in a Distributed Database System. P. Krishna Reddy,Subhash Bhalla 1993 The distributed locking based approaches to concurrency control in a distributed database system, are prone to occurrence of deadlocks. An algorithm for deadlock prevention has been considered in this proposal. In this algorithm, a transaction is executed by forming wait for relations with other conflicting transactions. The technique for generation of this kind of precedence graph for transaction execution is analyzed. This approach is a fully distributed approach. The technique is free from deadlocks, avoids resubmission of transactions, and hence reduces processing delays within the distributed environment. SIGMOD Record Database Compression. Mark A. Roth,Scott J. Van Horn 1993 Despite the fact that computer memory costs have decreased dramatically over the past few years, data storage still remains, and will probably always remain, an important cost factor for many large scale database applications. Compressing data in a database system is attractive for two reasons: data storage reduction and performance improvement. Storage reduction is a direct and obvious benefit, while performance improves because smaller amounts of physical data need to be moved for any particular operation on the database. We address several aspects of reversible data compression and compression techniques: general concepts of data compression; a number of compression techniques; a comparison of the effects of compression on common data types; advantages and disadvantages of compressing data; and future research needs. SIGMOD Record Relational Database Integration in the IBM AS/400. S. Scholerman,L. Miller,J. Tenner,S. Tomanek,M. Zolliker 1993 "A great deal of research has been focused on the development of database machines. In parallel to this work some vendors have developed general purpose machines with database function built directly into the machine architecture. The IBM AS/400 is one of the principle examples of this approach. Designed with a strong object orientation and the basis functions of the relational database model integrated into it's architecture, the AS/400 has proved to be a commercial success. In the present work we look at the database component of the AS/400." SIGMOD Record "Editor's Notes." Arie Segev 1993 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1993 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1993 "Editor's Notes." SIGMOD Record Database Research at Wisconsin. 1993 Database Research at Wisconsin. SIGMOD Record Concurrency Control in Trusted Database Management Systems: A Survey. Bhavani M. Thuraisingham,Hai-Ping Ko 1993 Recently several algorithms have been proposed for concurrency control in a Trusted Database Management System (TDBMS). The various research efforts are examining the concurrency control algorithms developed for DBMSs and adapting them for a multilevel environment. This paper provides a survey of the concurrency control algorithms for a TDBMS and discusses future directions. SIGMOD Record Schema Transformation without Database Reorganization. Markus Tresch,Marc H. Scholl 1993 We argue for avoiding database reorganizations due to schema modification in object-oriented systems, since these are expensive operations and they conflict with reusing existing software components. We show that data independence, which is a neglected concept in object databases, helps to avoid reorganizations in case of capacity preserving and reducing schema transformations. We informally present a couple of examples to illustrate the idea of a schema transformation methodology that avoids database reorganization. SIGMOD Record Experiences with HyperBase: A Hypertext Database Supporting Collaborative Work. Uffe Kock Wiil 1993 This paper describes the architecture and experiences with a hyperbase (hypertext database). HyperBase is based on the client-server model and has been designed especially to support collaboration. HyperBase has been used in a number of (hypertext) applications in our lab and is currently being used in research projects around the world to provide database support to all kinds of applications. One application from our lab is a multiuser hypertext system for collaboration which deals with three fundamental issues in simultaneous sharing: access contention, real-time monitoring and real-time communication. Major experiences with HyperBase (collaboration support, data modeling and performance) gained from use both in our lab and in different projects at other research sites are reported. One major lesson learned is that HyperBase can provide powerful support for data sharing among multiple users simultaneously sharing the same environment. SIGMOD Record Change at ONR, and Many Funding Announcements Elsewhere. Marianne Winslett 1993 In this issue, we describe recent changes at the US Office of Naval Research, and then cover a dozen recent announcements of funding in the US for postdocs, sabbaticals, summer jobs, equipment, and research grants—just about anything you might need—from DARPA, the Army, the Air Force, NSF, NIST, and the National Center for Automated Information Research. We also list the newly funded NSF Grand Challenge proposals. SIGMOD Record SIGMOD Goes On Line: New Member Service Via Internet. Marianne Winslett 1993 SIGMOD Goes On Line: New Member Service Via Internet. SIGMOD Record Update on SIGMOD On-Line Services. Marianne Winslett 1993 Update on SIGMOD On-Line Services. SIGMOD Record NSF and HPCC Under Attack. Marianne Winslett 1993 In this issue we describe recent turmoil over NSF, ARPA, and HPCC, and cover funding opportunities from NSF, CRA, NIH, the Army, the Air Force, NIST, and NASA. SIGMOD Record Timely Access to Future Funding Announcements. Marianne Winslett,Iris Sheauyin Chu 1993 "The last issue of this column appeared six months ago, so in the interim many requests for proposals have been issued, with many of their due dates already past. We are happy to announce the availability of up-to-the-minute funding information on line, so that the Record's publication schedule need not prevent readers' timely access to funding information. In addition, we briefly recap recent requests for proposals from NASA, NLM, the Human Brain Project, the US Army, the US Navy, ARPA, NSF, and Intel; more detailed information on any of these (except Intel) can be found in the on-line archives. We also report on news at NSF and NIST." SIGMOD Record Database Research at the University of Florida. 1993 Database Research at the University of Florida. SIGMOD Record Database Research at the University of Twente. 1993 Database Research at the University of Twente. SIGMOD Record Calls for Papers. 1993 Calls for Papers. SIGMOD Record Calls for Papers. 1993 Calls for Papers. ICDE Analysis of Common Subexpression Exploitation Models in Multiple-Query Processing. Jamal R. Alsabbagh,Vijay V. Raghavan 1994 In multiple-query processing, a subexpression that appears in more than one query is called a common subexpression (CSE). A CSE needs to he evaluated once only to produce a temporary result that can then be used to evaluate all the queries containing the CSE. Therefore, the cost of evaluating the CSE is amortized over the queries requiring its evaluation. Two queries, posed simultaneously to the optimizer, may however contain subexpression that are not equivalent but are, nevertheless related by implication (the extension of one is a proper subset of the other) or intersection (the intersection of the two extensions is a proper subset of both extensions). In order to exploit the opportunity for cost amortization offered by the two latter relationships. the optimizer must rewrite the two queries in such a way that a CSE is induced. This paper compares, empirically and analytically, the performance of the various query execution models that are implied by different approaches to query rewriting ICDE QBISM: Extending a DBMS to Support 3D Medical Images. Manish Arya,William F. Cody,Christos Faloutsos,Joel E. Richardson,Arthur Toya 1994 "Describes the design and implementation of QBlSM (Query By Interactive, Spatial Multimedia), a prototype for querying and visualizing 3D spatial data. The first application is in an area in medical research, in particular, Functional Brain Mapping. The system is built on top of the Starburst DBMS extended to handle spatial data types, specifically, scalar fields and arbitrary regions of space within such fields. The authors list the requirements of the application, discuss the logical and physical database design issues, and present timing results from their prototype. They observed that the DBMS' early spatial filtering results in significant performance savings because the system response time is dominated by the amount of data retrieved, transmitted, and rendered" ICDE Title, Message from the General Chairs, Message from the Program Chair, Committees, Reviewers, Author Index. 1994 Title, Message from the General Chairs, Message from the Program Chair, Committees, Reviewers, Author Index. ICDE On the Interaction Between ISA and Cardinality Constraints. Diego Calvanese,Maurizio Lenzerini 1994 ISA and cardinality constraints are among the most interesting types of constraints in data models. ISA constraints are used to establish several forms of containment among classes, and are receiving great attention in moving to object-oriented data models, where classes are organized in hierarchies based on a generalization/specialization principle. Cardinality constraints impose restrictions on the number of links of a certain type involving every instance of a given class, and can be used for representing several forms of dependencies between classes, including functional and existence dependencies. While the formal properties of each type of constraints are now well understood, little is known of their interaction. We present an effective method for reasoning about a set of ISA and cardinality constraints in the context of a simple data model based on the notions of classes and relationships. In particular, the method allows one both to verify the satisfiability of a schema and to check whether a schema implies a given constraint of any of the two kinds. We prove that the method is sound and complete, thus showing that the reasoning problem for ISA and cardinality constraints is decidable ICDE Distributed Heterogeneous Information Systems (Abstract). William Carpenter,Gholamreza Emami,Emanuel G. Mamatas,Alok C. Nigam 1994 Distributed Heterogeneous Information Systems (Abstract). ICDE Comparing and Synthesizing Integrity Checking Methods for Deductive Databases. Matilde Celma,Carlos Garcia,Laura Mota-Herranz,Hendrik Decker 1994 Comparing and Synthesizing Integrity Checking Methods for Deductive Databases. ICDE Implementing Calendars and Temporal Rules in Next Generation Databases. Rakesh Chandra,Arie Segev,Michael Stonebraker 1994 In applications like financial trading, scheduling, manufacturing and process control, time based predicates in queries and rules are very important. There is also a need to define lists of time points or intervals. The authors refer to these lists as calendars. The authors present a system of calendars that allow specification of natural-language time-based expressions, maintenance of valid time in databases, specification of temporal conditions in database queries and rules, and user-defined semantics for date manipulation. A simple list based language is proposed to define, manipulate and query calendars. The design of the parser and an algorithm for efficient evaluation of calendar expressions is also described. The paper also describes the implementation of time-based rules in POSTGRES using the proposed system of calendars ICDE Papyrus: A History-Based VLSI Design Process Management System. Tzi-cker Chiueh,Randy H. Katz 1994 "This paper describes the design and implementation of a VLSI design process management system called Papyrus, which is built upon a history-based design process model that supports both routine and exploratory VLSI design processes. Emphasis of this paper is put on the descriptions of Papyrus's basic data models, design decisions, and implementation details. The operational prototype features a transparent dynamic load balancing scheme to exploit the computation power of networked workstations, an atomicity-guarantee mechanism to preserve the high-level abstraction of the design task construct, an interactive design-history manipulation facility, and a set of storage management techniques to reduce the storage overhead entailed by the single assignment update semantics, which is crucial to the support of the so-called rework mechanism. This system also embodies an innovative history-based meta-data inference scheme that automates many previously user-responsible design data management functions" ICDE On the Selection of Optimal Index Configuration in OO Databases. Sunil Choenni,Elisa Bertino,Henk M. Blanken,Thiel Chang 1994 An operation in object-oriented databases gives rise to the processing of a path. Several database operations may result into the same path. The authors address the problem of optimal index configuration for a single path. As it is shown an optimal index configuration for a path can be achieved by splitting the path into subpaths and by indexing each subpath with the optimal index organization. The authors present an algorithm which is able to select an optimal index configuration for a given path. The authors consider a limited number of existing indexing techniques (simple index, inherited index, nested inherited index, multi-index, and multi-inherited index) but the principles of the algorithm remain the same adding more indexing techniques ICDE Semantics-Based Multilevel Transaction Management in Federated Systems. Andrew Deacon,Hans-Jörg Schek,Gerhard Weikum 1994 A federated database management system (FDBMS) is a special type of distributed database system that enables existing local databases, in a heterogeneous environment, to maintain a high degree of autonomy. One of the key problems in this setting is the coexistence of local transactions and global transactions, where the latter access and manipulate data of multiple local databases. In modeling FDBMS transaction executions the authors propose a more realistic model than the traditional read/write model; in their model a local database exports high-level operations which are the only operations distributed global transactions can execute to access data in the shared local databases. Such restrictions are not unusual in practice as, for example, no airline or bank would ever permit foreign users to execute ad hoc queries against their databases for fear of compromising autonomy. The proposed architecture can be elegantly modeled using the multilevel nested transaction model for which a sound theoretical foundation exists to prove concurrent executions correct. A multilevel scheduler that is able to exploit the semantics of exported operations can significantly increase concurrency by ignoring pseudo conflicts. A practical scheduling mechanism for FDBMSs is described that offers the potential for greater performance and more flexibility than previous approaches based on the read/write model ICDE Transactional Workflow Management in Distributed Object Computing Environments. Dimitrios Georgakopoulos 1994 Focuses on transactional workflows, i.e., the advanced transaction technology required to (i) ensure the reliability of tasks in a workflow, and the correctness and reliability of concurrent workflows, and (ii) support the specification and management of extended transactions models. In addition, the author discusses research and development at GTE Laboratories to produce a Transaction Specification and Management Environment (TSME) that can satisfy such requirements. He also discusses the integration of DOM and TSME technologies ICDE Specification and Management of Extended Transactions in a Programmable Transaction Environment. Dimitrios Georgakopoulos,Mark F. Hornick,Piotr Krychniak,Frank Manola 1994 A Transaction Specification and Management Environment (TSME) is a transaction processing system toolkit that supports the definition and construction of application-specific extended transaction models (ETMs). The TSME provides a transaction specification language that allows a transaction model designer to create implementation-independent specifications of extended transactions. In addition, the TSME provides a programmable transaction management mechanism that assembles and configures a run-time environment to support specified ETMs. The authors discuss the TSME in the context of a distributed object management system (DOMS), and describe specifications of extended transactions and corresponding configurations of transaction management mechanisms ICDE Object Placement in Parallel Object-Oriented Database Systems. Shahram Ghandeharizadeh,David Wilhite,Kai-Ming Lin,Xiaoming Zhao 1994 Parallelism is a viable solution to constructing high performance object-oriented database systems. In parallel systems based on a shared-nothing architecture, the database is horizontally declustered across multiple processors, enabling the system to employ multiple processors to speedup the execution time of a query. The placement of objects across the processors has a significant impact on the performance of queries that traverse a few objects. The paper describes and evaluates a greedy algorithm for the placement of objects across the processors of a system. Moreover, it describes two alternative availability strategies and quantifies their performance tradeoff using a trace-driven simulation study ICDE Sort-Merge-Join: An Idea Whose Time Has(h) Passed? Goetz Graefe 1994 Matching two sets of data items is a fundamental operation required in relational, extensible, and object-oriented database systems alike. However, the pros and cons of sort- and hash-based query evaluation techniques in modern query processing systems are still not fully understood. After our earlier research clarified strengths and weaknesses of sort- and hash-based query processing techniques and suggested remedies for the shortcomings of hash-based algorithms, the present paper outlines a number of further differences between sort-merge-join and hybrid hash join that traditionally have been ignored in such comparisons and render sort-merge-join mostly obsolete. We consolidate old and raise new issues pertinent to the comparison of sort- and hash-based query evaluation techniques and stir some thought and discussion among both academic and industrial database system builders ICDE A Multi-Set Extended Relational Algebra - A Formal Approach to a Practical Issue. Paul W. P. J. Grefen,Rolf A. de By 1994 A Multi-Set Extended Relational Algebra - A Formal Approach to a Practical Issue. ICDE Approximate Analysis of Real-Time Database Systems. Jayant R. Haritsa 1994 During the past few years, several studies have been made on the performance of real-time database systems with respect to the number of transactions that miss their deadlines. These studies have used either simulation models or database testbeds as their performance evaluation tools. We present a preliminary analytical performance study of real-time transaction processing. Using a series of approximations, we derive simple closed-form solutions to reduced real-time database models. Although quantitatively approximate, the solutions accurately capture system sensitivity to workload parameters and indicate conditions under which performance bounds are achieved ICDE Performance Evaluation of Grid Based Multi-Attibute Record Declustering Methods. Bhaskar Himatsingka,Jaideep Srivastava 1994 Performance Evaluation of Grid Based Multi-Attibute Record Declustering Methods. ICDE Data Management in Delayed Conferencing (Abstract). Arding Hsu 1994 Data Management in Delayed Conferencing (Abstract). ICDE Object Allocation in Distributed Databases and Mobile Computers. Yixiu Huang,Ouri Wolfson 1994 This paper makes two contributions. First, we introduce a model for evaluating the performance of data allocation and replication algorithms in distributed databases. The model is comprehensive in the sense that it accounts for I/O cost, for communication cost, and for limits on the minimum number of copies of the object (to ensure availability). The second contribution of this paper is the introduction and analysis of an algorithm for automatic dynamic allocation of replicas to processors. Using the new model, we compare the performance of the traditional read-one-write-all static allocation algorithm, to the performance of the dynamic allocation algorithm. As a result, we obtain the relationship between the communication cost and I/O cost for which static allocation is superior to dynamic allocation, and the relationships for which dynamic allocation is superior ICDE Object Skeletons: An Efficient Navigation Structure for Object-Oriented Database Systems. Kien A. Hua,Chinmoy Tripathy 1994 One common requirement of object-oriented applications is the efficient support of complex objects. This requirement is one salient feature of object orientation. Access to a complex object involves true basic steps: (1) The predicate is evaluated to identify the complex object. (2) The qualified complex object is traversed to retrieve the required components. Over the past few years, several indexing schemes have been developed to support step 1. However, support structures for the efficient execution of step 2 has received little attention. To address this problem, we propose, in this paper, using networks of unique object identifiers (Object Skeletons) as a navigational structure to aid query processing of complex objects. In this approach, skeletons of complex objects contain only the semantic information. Once a skeleton has been loaded into memory, navigation along the complex object can be done with no further disk access. Furthermore, since the descriptive information of an object is stored separately from its object identifier, it is free to migrate anywhere in the database. To assess the efficiency of this approach, we built a prototype and compared its performance to some recently proposed indexing schemes. The results of our study indicate that this technique can provide very impressive savings of both space and time ICDE Active Databases for Active Repositories. Heinrich Jasper 1994 The various activities necessary for constructing a software product are described by software process models. Many of the actions mentioned there are supported by tools that use a repository in order to create, manipulate, generate, etc. the deliverables. The process is tailored for each project to necessary work and planned with respect to existing resources. This results in a schedule for each project that is manually compared with ongoing work. We introduce the idea of active repositories that partially automate scheduling and controlling of the activities described within a process model. The notion of active repositories is based on active database technology that allows for detecting events and triggering the corresponding actions. Events are state changes in the repository or raised by external components, e.g. a clock or CASE tool. Actions manipulate the repository, trigger CASE tools, signal external systems or notify the user ICDE Supporting Data Mining of Large Databases by Visual Feedback Queries. Daniel A. Keim,Hans-Peter Kriegel,Thomas Seidl 1994 Describes a query system that provides visual relevance feedback in querying large databases. The goal is to support the process of data mining by representing as many data items as possible on the display. By arranging and coloring the data items as pixels according to their relevance for the query, the user gets a visual impression of the resulting data set. Using an interactive query interface, the user may change the query dynamically and receives immediate feedback by the visual representation of the resulting data set. Furthermore, by using multiple windows for different parts of a complex query, the user gets visual feedback for each part of the query and, therefore, may easier understand the overall result. The system allows one to represent the largest amount of data that can be visualized on current display technology, provides valuable feedback in querying the database, and allows the user to find results which would otherwise remain hidden in the database ICDE A Method for Transforming Relational Schemas Into Conceptual Schemas. Paul Johannesson 1994 A major problem with currently existing database systems is that there often does not exist a conceptual understanding of the data. Such an understanding can be obtained by describing the data using a semantic data model, such as the ER model. Consequently, there is a need for methods that translate a schema in a traditional data model into a conceptual schema. We present a method for translating a schema in the relational model into a schema in a conceptual model. We also show that the schema produced has the same information capacity as the original schema. The conceptual model used is a formalization of an extended ER model, which also includes the subtype concept ICDE Query Optimization Strategies for Browsing Sessions. Martin L. Kersten,M. F. N. de Boer 1994 This paper describes techniques and experimental results to obtain response time improvement for a browsing session, i.e. a sequence of interrelated queries to locate a subset of interest. The optimization technique exploits symbolic analysis of the query interdependencies and retention of (partial) query answers. A prototype browsing session optimizer (BSO) has been constructed that runs as a front-end to the Ingres relational system. Based on the experiments reported, we propose to extend (existing) DBMSs with a mechanism to keep and reuse small answers by default. Such investments quickly pay off in sessions with interrelated queries ICDE Cooperative Problem Solving Using Database Conversations. Thomas Kirsche,Richard Lenz,Thomas Ruf,Hartmut Wedekind 1994 Cooperative problem solving is a joint style of producing and consuming data. Unfortunately, most database mechanisms developed so far; are more appropriate for competitive usage than for a cooperative working style. They mostly adopt an operational point of view which binds data to applications. Data-oriented mechanisms like check-in/out avoid this binding but do not improve synchronization towards concurrent usage of data. Conversations are an application-independent, tight framework for jointly modifying common data. The idea is to create transaction-spanning conversational working stages that organize different contributions instead of serializing accesses. To illustrate the conversation concept, an extended query language with conversational operations is presented ICDE Declustering Techniques for Parallelizing Temporal Access Structures. Vram Kouramajian,Ramez Elmasri,Anurag Chaudhry 1994 This paper addresses the issues of declustering temporal index and access structures for a single processor multiple independent disk architecture. The temporal index is the Monotonic B+-Tree which uses the time index temporal access structure. We devise a new algorithm, called multi-level round robin, for assigning tree nodes to multiple disks. The multi-level round robin declustering technique takes advantage of the append-only nature of temporal databases to achieve uniform load distribution, decrease response time, and increase the fanout of the tree by eliminating the need to store disk numbers within the tree nodes. We propose two declustering techniques for the time index access structures; one considers only time proximity while declustering, whereas the other considers both time proximity and data size. We investigate their performance over different types of temporal queries and show that various temporal queries have conflicting allocation criteria for the time index buckets. In addition, we devise two disk partition techniques for the time index buckets. The mutually exclusive technique partitions the disks into disjoint groups, whereas the shared disk technique allows the different types of buckets to share all disks ICDE Knowledge-Based Handling of Design Expertise. Pierre Morizet-Mahoudeaux,Einoshin Suzuki,Setsuo Ohsuga 1994 Research issues in the domain of AI for design can be organized in three categories: decision making, representation and knowledge handling. In the area of knowledge handling, this paper addresses issues concerning the management of design experience to guide a priori the generation of candidate solutions. The approach is based on keeping the trace of a previous design experience as a hierarchical knowledge base. A level in the hierarchy can be viewed as a level of granularity of the description of the design process. A general framework for defining a partial order function between the granularity levels in the knowledge bases of design expertise is proposed. It is then possible to compute the sets of the elements belonging to smaller granularity levels, which are linked to any component of the hierarchy. Thus, it makes it possible to compute the level in the hierarchy that can be reused without modification for the design of a new product. Computation of the appropriate level is mainly based on matching the data corresponding to the new requirements with these sets. The approach has been tested by using a multiple expert systems structure based on using interactively two systems, an expert system development tool for design, KAUS, and an expert system development tool for diagnosing engineering processes, SUPER. The intrinsic properties of SUPER have also been used for improving the design procedure when qualitative and quantitative knowledge is involved ICDE Discovering Database Summaries through Refinements of Fuzzy Hypotheses. Doheon Lee,Myoung-Ho Kim 1994 Recently, many applications such as scientific databases and decision supporting systems that require comprehensive analysis of a very large amount of data, have been evolved. Summary discovery techniques, which extract compact representations grasping the meanings of large databases, can play a major role in those applications. We present an effective and robust method to discover simple linguistic summaries. We first propose a hypothesis refinement algorithm that is a key technique for our summary discovery method. Using the algorithm, a formal procedure for summary discovery is presented together with an illustrative example. Our discovery method can handle both rigid concepts and fuzzy concepts that occur frequently in practice. Discovered summaries can also be regarded as high-level interattribute dependencies ICDE Resolving Attribute Incompatibility in Database Integration: An Evidential Reasoning Approach. Ee-Peng Lim,Jaideep Srivastava,Shashi Shekhar 1994 "Resolving domain incompatibility among independently developed databases often involves uncertain information. DeMichiel (1989) showed that uncertain information can be generated by the mapping of conflicting attributes to a common domain, based on some domain knowledge. The authors show that uncertain information can also arise when the database integration process requires information not directly represented in the component databases, but can be obtained through some summary of data. They therefore propose an extended relational model based on Dempster-Shafer theory of evidence (1976) to incorporate such uncertain knowledge about the source databases. They also develop a full set of extended relational operations over the extended relations. In particular, an extended union operation has been formalized to combine two extended relations using Dempster's rule of combination. The closure and boundedness properties of the proposed extended operations are formulated" ICDE Polymorphic Reuse Mechanisms for Object-Oriented Database Specifications. Ling Liu,Roberto Zicari,Walter L. Hürsch,Karl J. Lieberherr 1994 A polymorphic approach to the incremental design and reuse of object-oriented methods and query specifications is presented. Using this approach, the effort required for manually reprogramming methods and queries due to schema modifications can be avoided or minimized. The salient features of of our approach are the use of propagation patterns and a mechanism for propagation pattern refinement. Propagation patterns can be employed as an interesting specification formalism for modeling operational requirements in object-oriented database systems. They encourage the reuse of operational specifications against the structural modification of an object-oriented schema. Propagation pattern refinement is suited for the specification of reusable operational modules, and for achieving reusability of propagation patterns towards the operational requirement changes ICDE Fast Ranking in Limited Space. Alistair Moffat,Justin Zobel 1994 Ranking techniques have long been suggested as alternatives to conventional Boolean methods for searching document collections. The cost of computing a ranking is, however, greater than the cost of performing a Boolean search, in terms of both memory space and processing time. The authors consider the resources required by the cosine method of ranking, and show that, with a careful application of indexing and selection techniques, both the space and the time required by ranking can be substantially reduced. The methods described in the paper have been used to build a retrieval system with which it is possible to process ranked queries of 40 terms in about 5% of the space required by previous implementations; in as little as 25% of the time; and without measurable degradation in retrieval effectiveness ICDE Exploiting Uniqueness in Query Optimization. G. N. Paulley,Per-Åke Larson 1994 Consider an SQL query that specifies duplicate elimination via a DISTINCT clause. Because duplicate elimination often requires an expensive sort of the query result, it is often worthwhile to identify unnecessary DISTINCT clauses and avoid the sort altogether. We prove a necessary and sufficient condition for deciding if a query requires duplicate elimination. The condition exploits knowledge about keys, table constraints, and query predicates. Because the condition cannot always be tested efficiently, we offer a practical algorithm that tests a simpler, sufficient condition. We consider applications of this condition for various types of queries, and show that we can exploit this condition in both relational and nonregulation database systems ICDE Efficient Support for Partial Write Operations in Replicated Databases. Michael Rabinovich,Edward D. Lazowska 1994 We present a new replica control technique targeted at replicated systems in which write operations update a portion of the information in the data item rather than replacing it entirely. The existing protocols capable of supporting partial writes must either perform the writes on all accessible replicas of the data item, or always apply the writes to the same group (quorum set) of replicas. In the former case, the system incurs high message overhead. In the latter case, if any of the replicas in this group fail, additional replicas must be synchronously brought up-to-date during the write operation causing delay to the operation. Also, in both cases, the system loses the advantage of load sharing provided by replication. Our protocol avoids performing the write on all nodes while preserving load sharing and reducing greatly the risk of having to propagate updates synchronously. We describe the protocol, prove it correct, and present a comparative performance study of our protocol and the existing alternatives ICDE Parallel Approaches to Database Management (Abstract). David S. Reiner 1994 Parallel Approaches to Database Management (Abstract). ICDE Capturing Design Dynamics the Concord Approach. Norbert Ritter,Bernhard Mitschang,Theo Härder,Michael Gesmann,Harald Schöning 1994 Capturing Design Dynamics the Concord Approach. ICDE Analysis of Reorganization Overhead in Log-Structured File Systems. John T. Robinson,Peter A. Franaszek 1994 In a log-structured file system (LFS), in general each block written to disk causes another disk block to become invalid data, resulting in one block of free space. Over time free disk space becomes highly fragmented, and a high level of dynamic reorganization may be required to coalesce free blocks into physically contiguous areas that subsequently can be used for logs. By consuming available disk bandwidth, this reorganization can degrade system performance. In a segmented disk LFS organization, the copy-and-compact reorganization method reads entire segments and then writes back all valid blocks. Other methods, suggested by earlier work on reduction of storage fragmentation for non-LFS disks, may access far fewer blocks (at the cost of increased CPU time). An analytic model is used to evaluate the effects on available disk bandwidth of dynamic reorganization, as a function of the read/write ratio, storage utilization, and degree of data movement required by dynamic reorganization for steady-state operation. It is shown that decreasing reorganization overhead can have dramatic effects on available disk bandwidth ICDE The TP-Index: A Dynamic and Efficient Indexing Mechanism for Temporal Databases. Han Shen,Beng Chin Ooi,Hongjun Lu 1994 To support temporal operators efficiently, indexing based on temporal attributes must be supported. The authors propose a dynamic and efficient index scheme called the time polygon (TP-index) for temporal databases. In the scheme, temporal data are mapped into a two-dimensional temporal space, where the data can be clustered based on time. The date space is then partitioned into time polygons where each polygon corresponds to a data page. The time polygon directory can be organized as a hierarchical index. The index handles long duration temporal data elegantly and efficiently. The performance analysis indicates that the time polygon index is efficient both in storage utilization and query search ICDE Efficient Organization of Large Multidimensional Arrays. Sunita Sarawagi,Michael Stonebraker 1994 Large multidimensional arrays are widely used in scientific and engineering database applications. The authors present methods of organizing arrays to make their access on secondary and tertiary memory devices fast and efficient. They have developed four techniques for doing this: (1) storing the array in multidimensional “chunks” to minimize the number of blocks fetched, (2) reordering the chunked array to minimize seek distance between accessed blocks, (3) maintaining redundant copies of the array, each organized for a different chunk size and ordering and (4) partitioning the array onto platters of a tertiary memory device so as to minimize the number of platter switches. The measurements on real data obtained from global change scientists show that accesses on arrays organized using these techniques are often an order of magnitude faster than on the unoptimized data ICDE Transactional Workflows: Research, Enabling Technologies, and Applications (Abstract). Amit P. Sheth 1994 Transactional Workflows: Research, Enabling Technologies, and Applications (Abstract). ICDE Managing Change in the Rufus System. Peter M. Schwarz,Kurt A. Shoens 1994 Rufus is an information system that models user data with objects taken from a class system. Due to the importance of coping with changes to the schema, Rufus has adopted the conformity-based model of Melampus. This model enables Rufus to cope with schema changes more easily than traditional class- and inheritance-based data models. The paper reviews the Melampus data model and describes how it was implemented in the Rufus system. The authors show how changes to the schema can be accommodated with minimum disruption. They also review design decisions that contributed to streamlined schema evolution and compare this approach with those proposed in the literature ICDE X-500 Directory Schema Management. Daniel L. Silver,James W. Hong,Michael A. Bauer 1994 X-500 Directory Schema Management. ICDE Efficient Evaluation of the Valid-Time Natural Join. Michael D. Soo,Richard T. Snodgrass,Christian S. Jensen 1994 Joins are arguably the most important relational operators. Poor implementations are tantamount to computing the Cartesian product of the input relations. In a temporal database, the problem is more acute for two reasons. First, conventional techniques are designed for the optimization of joins with equality predicates, rather than the inequality predicates prevalent in valid-time queries. Second, the presence of temporally-varying data dramatically increases the size of the database. These factors require new techniques to efficiently evaluate valid-time joins. The authors address this need for efficient join evaluation in databases supporting valid-time. A new temporal-join algorithm based on tuple partitioning is introduced. This algorithm avoids the quadratic cost of nested-loop evaluation methods; it also avoids sorting. Performance comparisons between the partition-based algorithm and other evaluation methods are provided. While the paper focuses on the important valid-time natural join, the techniques presented are also applicable to other valid-time joins ICDE Mariposa: A New Architecture for Distributed Data. Michael Stonebraker,Paul M. Aoki,Robert Devine,Witold Litwin,Michael A. Olson 1994 We describe the design of Mariposa, an experimental distributed data management system that provides high performance in an environment of high data mobility and heterogeneous host capabilities. The Mariposa design unifies the approaches taken by distributed file systems and distributed databases. In addition, Mariposa provides a general, flexible platform for the development of new algorithms for distributed query optimization, storage management, and scalable data storage structures. This flexibility is primarily due to a unique rule-based design that permits autonomous, local-knowledge decisions to be made regarding data placement, query execution location, and storage management ICDE An Efficient Relational Implementation of Recursive Relationships using Path Signatures. Jukka Teuhola 1994 "The `parts explosion' is a classical problem, which is hard for relational database systems, due to recursion. A simple solution is suggested, which packs information of an ancestor path of a tuple into a fixed-length code, called signature. The coding technique is carefully adjusted to enable an efficient retrieval of the transitive closure, in terms of both disk accesses and DBMS calls. The code is lossy, and its purpose is to define a reasonably small superset of the closure, as well as establish an effective order of clustering. The method performs best for tree-structured hierarchies, where the processing time typically decreases by a factor of more than ten, compared to the trivial method. Also general directed graphs, both acyclic and cyclic, can be handled more efficiently" ICDE On a More Realistic Lock Contention Model and Its Analysis. Alexander Thomasian 1994 Most performance modeling studies of lock contention in transaction processing systems are deficient in that they postulate a homogeneous database access model. The non-homogeneous database access model described in this paper allows multiple transaction classes with different access patterns to the database regions. The performance of the system from the viewpoint of lock contention is analyzed in the context of the standard two-phase locking concurrency control method with the general waiting policy. The approximate analysis is based on mean values of parameters and derives expressions for the probability of lock conflict (usually leading to transaction blocking) and the mean blocking time. The latter requires estimating the distribution of the effective wait-depth encountered by blocked transactions and the mean waiting time associated with different blocking levels. The accuracy of the analysis is validated against simulation results and also shown to be more accurate than analytic solutions considering only two levels of transaction blocking. Previously proposed metrics for load control have limited applicability for the model under consideration ICDE Performance Analysis of RAID5 Disk Arrays with a Vacationing Server Model for Rebuild Mode Operation. Alexander Thomasian,Jai Menon 1994 Performance Analysis of RAID5 Disk Arrays with a Vacationing Server Model for Rebuild Mode Operation. ICDE Supporting Partial Data Accesses to Replicated Data. Peter Triantafillou,Feng Xiao 1994 Partial data access operations occur frequently in distributed systems. This paper presents new approaches for efficiently supporting partial data access operations to replicated data. We propose the replica modularization (RM) technique which suggests partitioning replicas into modules, which now become the minimum unit of data access. RM is shown to increase the availability of both partial read and write operations and improves performance by reducing access delays and the size of data transfers occurring during operation execution on replicated data. In addition, we develop a new module-based protocol (MB) in which different replication protocols are used to access different sets of replicas, with each replica storing different modules. The instance of MB we discuss here is a hybrid of the ROWA (Read One Write All) protocol and the MQ (Majority Quorum) protocol. MB allows a trade-off between storage costs and availability. We show that MB can achieve almost as high availability as the MQ protocol, but with considerably smaller storage costs ICDE Supporting High-Bandwidth Navigation in Object-Bases. Venu Vasudevan 1994 Magritte is an attempt to construct a high-bandwidth front-end to an object-base containing meta-data about SCAD designs. SCAD is a small part of a family of visualization applications where the end-user concurrently manipulates large collections of active data. Such end-user interfaces require a different paradigm of interaction than the object-at-a-time interfaces of current databases. Proposals here can be divided into mechanisms for scene creation and those for scene integration. The former allow a user to create a single scene with ease. The latter help in desktop management by allowing scenes to be combined and correlated. The implementation experience points out a number of shortcomings in current database offerings that need to be solved so as to ease the design of high-bandwidth front-ends ICDE Data Placement and Buffer Management for Concurrent Mergesorts with Parallel Prefetching. Kun-Lung Wu,Philip S. Yu,James Z. Teng 1994 Various data placement policies are studied for the merge phase of concurrent mergesorts using parallel prefetching, where initial sorted runs (input) of a merge and its final sorted run (output) are stored on multiple disks but each run resides only on a single disk. Since the merge phase involves only sequential references, parallel prefetching can be attractive an reducing the average response time for concurrent merges. However, without careful buffer control, severe thrashing may develop under certain run placement policies, reducing the benefits of prefetching. The authors examine through detailed simulations three different run placement policies. The results show that even though buffer thrashing can be almost avoided by placing the output run of a job on the same disk with at least one of its input runs, this thrashing-avoiding run placement policy can be substantially outperformed by other policies that use buffer thrashing control. With buffer thrashing avoidance, the best performance as achieved by a run placement policy that uses a proper subset of disks dedicated for writing the output runs while the rest of the disks are used for prefetching the input runs in parallel ICDE Index Structures for Information Filtering Under the Vector Space Model. Tak W. Yan,Hector Garcia-Molina 1994 "With the ever increasing volumes of information generation, users of information systems are facing an information overload. It is desirable to support information filtering as a complement to traditional retrieval mechanism. The number of users, and thus profiles (representing users'' long-term interests), handled by an information filtering system is potentially huge, and the system has to process a constant stream of incoming information in a timely fashion. The efficiency of the filtering process is thus an important issue. In this paper, we study what data structures and algorithms can be used to efficiently perform large-scale information filtering under the vector space model, a retrieval model established as being effective. We apply the idea of the standard inverted index to index user profiles. We devise an alternative to the standard inverted index, in which we, instead of indexing every term in a profile, select only the significant ones to index. We evaluate their performance and show that the indexing methods require orders of magnitude fewer I/Os to process a document than when no index is used. We also show that the proposed alternative performs better in terms of I/O and CPU processing time in many cases." ICDE Performing Group-By before Join. Weipeng P. Yan,Per-Åke Larson 1994 Performing Group-By before Join. ICDE A Hybrid Transitive Closure Algorithm for Sequential and Parallel Processing. Qi Yang,Clement T. Yu,Chengwen Liu,Son Dao,Gaoming Wang,Tracy Pham 1994 A new hybrid algorithm is proposed for well-formed path problems including the transitive closure problem. The CPU time for computation is O(ne), and blocking technique is incorporated to reduce the disk I/O cost in disk-resident environment. The new features of the new algorithm are that only parents sets instead of descendant sets are loaded in from disk, and the computation can be parallelized efficiently. Simulation results show that our algorithm is superior to other existing algorithms in sequential computation, and that linear speedup is achieved in parallel computation ICDE Disk Allocation Methods for Parallelizing Grid Files. Yvonne Zhou,Shashi Shekhar,Mark Coyle 1994 Disk Allocation Methods for Parallelizing Grid Files. ICDE A Query Sampling Method of Estimating Local Cost Parameters in a Multidatabase System. Qiang Zhu,Per-Åke Larson 1994 A Query Sampling Method of Estimating Local Cost Parameters in a Multidatabase System. ICDE Applying Signatures for Forward Traversal Query Processing in Object-Oriented Databases. Hwan-Seung Yong,Sukho Lee,Hyoung-Joo Kim 1994 Forward traversal methods are used to process queries having nested predicates in object-oriented databases. To expedite the forward traversal, a signature replication technique is proposed. Object signature is a signature formed by values of all atomic attributes defined in the object. When an object refers to other objects through its attribute, the object signature of the referred object is stored into the referring object. Using object signatures, nested predicates can be checked without inspecting referred objects ICDE Storage Reclamation and Reorganization in Client-Server Persistent Object Stores. Voon-Fee Yong,Jeffrey F. Naughton,Jie-Bing Yu 1994 The authors develop and evaluate a number of storage reclamation algorithms for client-server persistent object stores. Experience with a detailed simulation and a prototype implementation in the Exodus storage manager shows that one of the proposed algorithms, the Incremental Partitioned Collector, is complete, maintains transaction semantics, and can be run incrementally and concurrently with client applications. Furthermore, it can significantly improve subsequent system performance by reclustering data, rendering it attractive even for systems that choose not to support automatic storage reclamation SIGMOD Conference Quest: A Project on Database Mining. Rakesh Agrawal,Michael J. Carey,Christos Faloutsos,Sakti P. Ghosh,Maurice A. W. Houtsma,Tomasz Imielinski,Balakrishna R. Iyer,A. Mahboob,H. Miranda,Ramakrishnan Srikant,Arun N. Swami 1994 Quest: A Project on Database Mining. SIGMOD Conference Database Issues in Telecommunications Network Management. Ilsoo Ahn 1994 Database Issues in Telecommunications Network Management. SIGMOD Conference UniSQL/X Unified Relational and Object-Oriented Database System. Won Kim 1994 UniSQL/X Unified Relational and Object-Oriented Database System. SIGMOD Conference Evolving Teradata Decision Support for Massively Parallel Processing with UNIX. Carrie Ballinger 1994 Evolving Teradata Decision Support for Massively Parallel Processing with UNIX. SIGMOD Conference Staggered Striping in Multimedia Information Systems. Steven Berson,Shahram Ghandeharizadeh,Richard R. Muntz,Xiangyu Ju 1994 Multimedia information systems have emerged as an essential component of many application domains ranging from library information systems to entertainment technology. However, most implementations of these systems cannot support the continuous display of multimedia objects and suffer from frequent disruptions and delays termed hiccups. This is due to the low I/O bandwidth of the current disk technology, the high bandwidth requirement of multimedia objects, and the large size of these objects that almost always requires them to be disk resident. One approach to resolve this limitation is to decluster a multimedia object across multiple disk drives in order to employ the aggregate bandwidth of several disks to support the continuous retrieval (and display) of objects. This paper describes staggered striping as a novel technique to provide effective support for multiple users accessing the different objects in the database. Detailed simulations confirm the superiority of staggered striping. SIGMOD Conference Sleepers and Workaholics: Caching Strategies in Mobile Environments. Daniel Barbará,Tomasz Imielinski 1994 "In the mobile wireless computing environment of the future a large number of users equipped with low powered palm-top machines will query databases over the wireless communication channels. Palmtop based units will often be disconnected for prolonged periods of time due to the battery power saving measures; palmtops will also frequencly relocate between different cells and connect to different data servers at different times. Caching of frequently accessed data items will be an important technique that will reduce contention on the narrow bandwidth wireless channel. However, cache invalidation strategies will be severely affected by the disconnection and mobility of the clients. The server may no longer know which clients are currently residing under its cell and which of them are currently on. We propose a taxonomy of different cache invalidation strategies and study the impact of client's disconnection times on their performance. We determine that for the units which are often disconnected (sleepers) the best cache invalidation strategy is based on signatures previously used for efficient file comparison. On the other hand, for units which are connected most of the time (workaholics), the best cache invalidation strategy is based on the periodic broadcast of changed data items." SIGMOD Conference ASSET: A System for Supporting Extended Transactions. Alexandros Biliris,Shaul Dar,Narain H. Gehani,H. V. Jagadish,Krithi Ramamritham 1994 Extended transaction models in databases were motivated by the needs of complex applications such as CAD and software engineering. Transactions in such applications have diverse needs, for example, they may be long lived and they may need to cooperate. We describe ASSET, a system for supporting extended transactions. ASSET consists of a set of transaction primitives that allow users to define custom transaction semantics to match the needs of specific applications. We show how the transaction primitives can be used to specify a variety of transaction models, including nested transactions, split transactions, and sagas. Application-specific transaction models with relaxed correctness criteria, and computations involving workflows, can also be specified using the primitives. We describe the implementation of the ASSET primitives in the context of the Ode database. SIGMOD Conference EOS: An Extensible Object Store. Alexandros Biliris,Euthimios Panagos 1994 EOS: An Extensible Object Store. SIGMOD Conference Open Object Database Management Systems. José A. Blakeley 1994 Open Object Database Management Systems. SIGMOD Conference Multi-Step Processing of Spatial Joins. Thomas Brinkhoff,Hans-Peter Kriegel,Ralf Schneider,Bernhard Seeger 1994 "Spatial joins are one of the most important operations for combining spatial objects of several relations. In this paper, spatial join processing is studied in detail for extended spatial objects in two-dimensional data space. We present an approach for spatial join processing that is based on three steps. First, a spatial join is performed on the minimum bounding rectangles of the objects returning a set of candidates. Various approaches for accelerating this step of join processing have been examined at the last year's conference [BKS 93a]. In this paper, we focus on the problem how to compute the answers from the set of candidate which is handled by the following two steps. First of all, sophisticated approximations are used to identify answers as well as to filter out false hits from the set of candidates. For this purpose, we investigate various types of conservative and progressive approximations. In the last step, the exact geometry of the remaining candidates has to be tested against the join predicate. The time required for computing spatial join predicates can essentially be reduced when objects are adequately organized in main memory. In our approach, objects are first decomposed into simple components which are exclusively organized by a main-memory resident spatial data structure. Overall, we present a complete approach of spatial join processing on complex spatial objects. The performance of the individual steps of our approach is evaluated with data sets from real cartographic applications. The results show that our approach reduces the total execution time of the spatial join by factors." SIGMOD Conference GENESYS: A System for Efficient Spatial Query Processing. Thomas Brinkhoff,Hans-Peter Kriegel,Ralf Schneider,Bernhard Seeger 1994 GENESYS: A System for Efficient Spatial Query Processing. SIGMOD Conference The MEDUSA Project: Autonomous Data Management in a Shared-Nothing Parallel Database Machine. George M. Bryan,Wayne E. Moore,B. J. Curry,K. W. Lodge,J. Geyer 1994 The MEDUSA Project: Autonomous Data Management in a Shared-Nothing Parallel Database Machine. SIGMOD Conference "Parallel Database Systems in the 1990's." Michael J. Carey 1994 "Parallel Database Systems in the 1990's." SIGMOD Conference Shoring Up Persistent Applications. Michael J. Carey,David J. DeWitt,Michael J. Franklin,Nancy E. Hall,Mark L. McAuliffe,Jeffrey F. Naughton,Daniel T. Schuh,Marvin H. Solomon,C. K. Tan,Odysseas G. Tsatalos,Seth J. White,Michael J. Zwilling 1994 SHORE (Scalable Heterogeneous Object REpository) is a persistent object system under development at the University of Wisconsin. SHORE represents a merger of object-oriented database and file system technologies. In this paper we give the goals and motivation for SHORE, and describe how SHORE provides features of both technologies. We also describe some novel aspects of the SHORE architecture, including a symmetric peer-to-peer server architecture, server customization through an extensible value-added server facility, and support for scalability on multiprocessor systems. An initial version of SHORE is already operational, and we expect a release of Version 1 in mid-1994. SIGMOD Conference Fine-Grained Sharing in a Page Server OODBMS. Michael J. Carey,Michael J. Franklin,Markos Zaharioudakis 1994 For reasons of simplicity and communication efficiency, a number of existing object-oriented database management systems are based on page server architectures; data pages are their minimum unit of transfer and client caching. Despite their efficiency, page servers are often criticized as being too restrictive when it comes to concurrency, as existing systems use pages as the minimum locking unit as well. In this paper we show how to support object-level locking in a page server context. Several approaches are described, including an adaptive granularity approach that uses page-level locking for most pages but switches to object-level locking when finer-grained sharing is demanded. We study the performance of these approaches, comparing them to both a pure page server and a pure object server. For the range of workloads that we have examined, our results indicate that a page server is clearly preferable to an object server. Moreover, the adaptive page server is shown to provide very good performance, generally outperforming the pure page server, the pure object server, and the other alternatives as well. SIGMOD Conference Query by Diagram: A Graphical Environment for Querying Databases. Tiziana Catarci,Giuseppe Santucci 1994 Query by Diagram: A Graphical Environment for Querying Databases. SIGMOD Conference ODMG-93: A Standard for Object-Oriented DBMSs. R. G. G. Cattell 1994 ODMG-93: A Standard for Object-Oriented DBMSs. SIGMOD Conference Adaptive Selectivity Estimation Using Query Feedback. Chung-Min Chen,Nick Roussopoulos 1994 In this paper, we propose a novel approach for estimating the record selectivities of database queries. The real attribute value distribution is adaptively approximated by a curve-fitting function using a query feedback mechanism. This approach has the advantage of requiring no extra database access overhead for gathering statistics and of being able to continuously adapt the value distribution through queries and updates. Experimental results show that the estimation accuracy of this approach is comparable to traditional methods based on statistics gathering. SIGMOD Conference From Structured Documents to Novel Query Facilities. Vassilis Christophides,Serge Abiteboul,Sophie Cluet,Michel Scholl 1994 "Structured documents (e.g., SGML) can benefit a lot from database support and more specifically from object-oriented database (OODB) management systems. This paper describes a natural mapping from SGML documents into OODB's and a formal extension of two OODB query languages (one SQL-like and the other calculus) in order to deal with SGML document retrieval.Although motivated by structured documents, the extensions of query languages that we present are general and useful for a variety of other OODB applications. A key element is the introduction of paths as first class citizens. The new features allow to query data (and to some extent schema) without exact knowledge of the schema in a simple and homogeneous fashion." SIGMOD Conference Optimization of Dynamic Query Evaluation Plans. Richard L. Cole,Goetz Graefe 1994 Traditional query optimizers assume accurate knowledge of run-time parameters such as selectivities and resource availability during plan optimization, i.e., at compile time. In reality, however, this assumption is often not justified. Therefore, the “static” plans produced by traditional optimizers may not be optimal for many of their actual run-time invocations. Instead, we propose a novel optimization model that assigns the bulk of the optimization effort to compile-time and delays carefully selected optimization decisions until run-time. Our previous work defined the run-time primitives, “dynamic plans” using “choose-plan” operators, for executing such delayed decisions, but did not solve the problem of constructing dynamic plans at compile-time. The present paper introduces techniques that solve this problem. Experience with a working prototype optimizer demonstrates (i) that the additional optimization and start-up overhead of dynamic plans compared to static plans is dominated by their advantage at run-time, (ii) that dynamic plans are as robust as the “brute-force” remedy of run-time optimization, i.e., dynamic plans maintain their optimality even if parameters change between compile-time and run-time, and (iii) that the start-up overhead of dynamic plans is significantly less than the time required for complete optimization at run-time. In other words, our proposed techniques are superior to both techniques considered to-date, namely compile-time optimization into a single static plan as well as run-time optimization. Finally, we believe that the concepts and technology described can be transferred to commercial query optimizers in order to improve the performance of embedded queries with host variables in the query predicate and to adapt to run-time system loads unpredictable at compile time. SIGMOD Conference Optimizing Queries on Files. Mariano P. Consens,Tova Milo 1994 We present a framework which allows the user to access and manipulate data uniformly, regardless of whether it resides in a database or in the file system (or in both). A key issue is the performance of the system. We show that text indexing, combined with newly developed optimization techniques, can be used to provide an efficient high level interface to information stored in files. Furthermore, using these techniques, some queries can be evaluated significantly faster than in standard database implementations. We also study the tradeoff between efficiency and the amount of indexing. SIGMOD Conference Partition Selection Policies in Object Database Garbage Collection. Jonathan E. Cook,Alexander L. Wolf,Benjamin G. Zorn 1994 The automatic reclamation of storage for unreferenced objects is very important in object databases. Existing language system algorithms for automatic storage reclamation have been shown to be inappropriate. In this paper, we investigate methods to improve the performance of algorithms for automatic for automatic storage reclamation of object databases. These algorithms are based on a technique called partitioned garbage collection, in which a subset of the entire database is collected independently of the rest. Specifically, we investigate the policy that is used to select what partition in the database should be collected. The policies that we propose and investigate are based on the intuition that the values of overwritten pointers provide good hints about where to find garbage. Using trace-driven simulation, we show that one of our policies requires less I/O to collect more garbage than any existing implementable policy and performs close to a near-optimal policy over a wide range of database sizes and object connectivities. SIGMOD Conference "Oracle's Symmetric Replication Technology and Implications for Application Design." Dean Daniels,Lip Boon Doo,Alan Downing,Curtis Elsbernd,Gary Hallmark,Sandeep Jain,Bob Jenkins,Peter Lim,Gordon Smith,Benny Souder,Jim Stamos 1994 "Oracle's Symmetric Replication Technology and Implications for Application Design." SIGMOD Conference A Performance Study of Transitive Closure Algorithms. Shaul Dar,Raghu Ramakrishnan 1994 We present a comprehensive performance evaluation of transitive closure (reachability) algorithms for databases. The study is based upon careful implementations of the algorithms, measures page I/O, and covers algorithms for full transitive closure as well as partial transitive closure (finding all successors of each node in a set of given source nodes). We examine a wide range of acyclic graphs with varying density and “locality” of arcs in the graph. We also consider query parameters such as the selectivity of the query, and system parameters such as the buffer size and the page and successor list replacement policies. We show that significant cost tradeoffs exist between the algorithms in this spectrum and identify the factors that influence the performance of the algorithms.An important aspect of our work is that we measure a number of different cost metrics, giving us a good understanding of the predictive power of these metrics with respect to I/O cost. This is especially significant since metrics such as number of tuples generated or number of successor list operations have been widely used to compare transitive closure algorithms in the literature. Our results strongly suggest that these other metrics cannot be reliability used to predict I/O cost of transitive closure evaluation. SIGMOD Conference Predictive Dynamic Load Balancing of Parallel and Distributed Rule and Query Processing. Hasanat M. Dewan,Salvatore J. Stolfo,Mauricio A. Hernández,Jae-Jun Hwang 1994 Expert Databases are environments that support the processing of rule programs against a disk resident database. They occupy a position intermediate between active and deductive databases, with respect to the level of abstraction of the underlying rule language. The operational semantics of the rule language influences the problem solving strategy, while the architecture of the processing environment determines efficiency and scalability.In this paper, we present elements of the PARADISER architecture and its kernel rule language, PARULEL. The PARADISER environment provides support for parallel and distributed evaluation of rule programs, as well as static and dynamic load balancing protocols that predictively balance a computation at runtime. This combination of features results in a scalable database rule and complex query processing architecture. We validate our claims by analyzing the performance of the system for two realistic test cases. In particular, we show how the performance of a parallel implementation of transitive closure is significantly improved by predictive dynamic load balancing. SIGMOD Conference DEC Data Distributor: for Data Replication and Data Warehousing. Daniel J. Dietterich 1994 DEC Data Distributor: for Data Replication and Data Warehousing. SIGMOD Conference METU Object-Oriented DBMS. Asuman Dogac,Ismailcem Budak Arpinar,Cem Evrendilek,Cetin Ozkan,Ilker Altintas,Ilker Durusoy,Mehmet Altinel,Tansel Okay,Yuksel Saygin 1994 METU Object-Oriented DBMS. SIGMOD Conference NonStop SQL: Scalability and Availability for Decision Support. Susanne Englert 1994 NonStop SQL: Scalability and Availability for Decision Support. SIGMOD Conference The IMPRESS DDT: A Database Design Toolbox Based on a Formal Specification Language. Jan Flokstra,Maurice van Keulen,Jacek Skowronek 1994 The IMPRESS DDT: A Database Design Toolbox Based on a Formal Specification Language. SIGMOD Conference Fast Subsequence Matching in Time-Series Databases. Christos Faloutsos,M. Ranganathan,Yannis Manolopoulos 1994 We present an efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequences into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree [9]. In more detail, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times. SIGMOD Conference Red Brick Warehouse: A Read-Mostly RDBMS for Open SMP Platforms. Phillip M. Fernandez 1994 Red Brick Warehouse: A Read-Mostly RDBMS for Open SMP Platforms. SIGMOD Conference Spatial Joins Using Seeded Trees. Ming-Ling Lo,Chinya V. Ravishankar 1994 Existing methods for spatial joins assume the existence of indices for the participating data sets. This assumption is not realistic for applications involving multiple map layer overlays or for queries involving non-spatial selections. In this paper, we explore a spatial join method that dynamically constructs index trees called seeded trees at join time. This methods uses knowledge of the data sets involved in the join process.Seeded trees are R-tree like structures, and are divided into the seed levels and the grown levels. The nodes in the seed levels are used to guide tree growth during tree construction. The seed levels can also be used to filter out some input data during construction, thereby reducing tree size. We develop a technique that uses intermediate linked lists during tree construction and significantly speeds up the tree construction process. The technique allows a large number of random disk accesses during tree construction to be replaced by smaller numbers of sequential accesses.Our performance studies show that spatial joins using seeded trees outperform those using other methods significantly in terms of disk I/O. The CPU penalties incurred are also lower except when seed-level filtering is used. SIGMOD Conference Outerjoins as Disjunctions. César A. Galindo-Legaria 1994 The outerjoin operator is currently available in the query language of several major DBMSs, and it is included in the proposed SQL2 standard draft. However, “associativity problems” of the operator have been pointed out since its introduction. In this paper we propose a shift in the intuition behind outerjoin: Instead of computing the join while also preserving its arguments, outerjoin delivers tuples that come either from the join or from the arguments. Queries with joins and outerjoins deliver tuples that come from one out of several joins, where a single relation is a trivial join. An advantage of this view is that, in contrast to preservation, disjunction is commutative and associative, which is a significant property for intuition, formalisms, and generation of execution plans.Based on a disjunctive normal form, we show that some data merging queries cannot be evaluated by means of binary outerjoins, and give alternative procedures to evaluate those queries. We also explore several evaluation strategies for outerjoin queries, including the use of semijoin programs to reduce base relations. SIGMOD Conference The Effectiveness of GlOSS for the Text Database Discovery Problem. Luis Gravano,Hector Garcia-Molina,Anthony Tomasic 1994 The popularity of on-line document databases has led to a new problem: finding which text databases (out of many candidate choices) are the most relevant to a user. Identifying the relevant databases for a given query is the text database discovery problem. The first part of this paper presents a practical solution based on estimating the result size of a query and a database. The method is termed GlOSS—Glossary of Servers Server. The second part of this paper evaluates the effectiveness of GlOSS based on a trace of real user queries. In addition, we analyze the storage cost of our approach. SIGMOD Conference Sybase Replication Server. Alex Gorelik,Yongdong Wang,Mark Deppe 1994 Sybase Replication Server. SIGMOD Conference Quickly Generating Billion-Record Synthetic Databases. Jim Gray,Prakash Sundaresan,Susanne Englert,Kenneth Baclawski,Peter J. Weinberger 1994 Evaluating database system performance often requires generating synthetic databases—ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses: (1) Parallelism to get generation speedup and scaleup. (2) Congruential generators to get dense unique uniform distributions. (3) Special-case discrete logarithms to generate indices concurrent to the base table generation. (4) Modification of (2) to get exponential, normal, and self-similar distributions.The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems. SIGMOD Conference Ptool: A Scalable Persistent Object Manager. Robert L. Grossman,Xiao Qin 1994 Ptool: A Scalable Persistent Object Manager. SIGMOD Conference Data Modeling of Time-Based Media. Simon J. Gibbs,Christian Breiteneder,Dennis Tsichritzis 1994 Many aspects of time-based media—complex data encoding, compression, “quality factors,” timing—appear problematic from a data modeling standpoint. This paper proposes timed streams as the basic abstraction for modeling time-based media. Several media-independent structuring mechanisms are introduced and a data model is presented which, rather than leaving the interpretation of multimedia data to applications, addresses the complex organization and relationships present in multimedia. SIGMOD Conference DBLearn: A System Prototype for Knowledge Discovery in Relational Databases. Jiawei Han,Yongjian Fu,Yue Huang,Yandong Cai,Nick Cercone 1994 A prototyped data mining system, DBLearn, has been developed, which efficiently and effectively extracts different kinds of knowledge rules from relational databases. It has the following features: high level learning interfaces, tightly integrated with commercial relational database systems, automatic refinement of concept hierarchies, efficient discovery algorithms and good performance. Substantial extensions of its knowledge discovery power towards knowledge mining in object-oriented, deductive and spatial databases are under research and development. SIGMOD Conference Practical Predicate Placement. Joseph M. Hellerstein 1994 Recent work in query optimization has addressed the issue of placing expensive predicates in a query plan. In this paper we explore the predicate placement options considered in the Montage DBMS, presenting a family of algorithms that form successively more complex and effective optimization solutions. Through analysis and performance measurements of Montage SQL queries, we classify queries and highlight the simplest solution that will optimize each class correctly. We demonstrate limitations of previously published algorithms, and discuss the challenges and feasibility of implementing the various algorithms in a commercial-grade system. SIGMOD Conference On Parallel Execution of Multiple Pipelined Hash Joins. Hui-I Hsiao,Ming-Syan Chen,Philip S. Yu 1994 In this paper we study parallel execution of multiple pipelined hash joins. Specifically, we deal with two issues, processor allocation and the use of hash filters, to improve parallel execution of hash joins. We first present a scheme to transform a bushy execution tree to an allocation tree, where each node denotes a pipeline. Then, processors are allocated to the nodes in the allocation tree based on the concept of synchronous execution time such that inner relations (i.e., hash tables) in a pipeline can be made available approximately the same time. In addition, the approach of hash filtering is investigated to further improve the overall performance. Performance studies are conducted via simulation to demonstrate the importance of processor allocation and to evaluate various schemes using hash filters. Simulation results indicate that processor allocation based on the allocation tree significantly outperforms that based on the original bushy tree, and that the effect of hash filtering becomes prominent as the number of relations in a query increases. SIGMOD Conference Data Replication for Mobile Computers. Yixiu Huang,A. Prasad Sistla,Ouri Wolfson 1994 Users of mobile computers will soon have online access to a large number of databases via wireless networks. Because of limited bandwidth, wireless communication is more expensive than wire communication. In this paper we present and analyze various static and dynamic data allocation methods. The objective is to optimize the communication cost between a mobile computer and the stationary computer that stores the online database. Analysis is performed in two cost models. One is connection (or time) based, as in cellular telephones, where the user is charged per minute of connection. The other is message based, as in packet radio networks, where the user is charged per message. Our analysis addresses both, the average case and the worst case for determining the best allocation method. SIGMOD Conference The MYRIAD Federated Database Prototype. San-Yih Hwang,Ee-Peng Lim,H.-R. Yang,S. Musukula,K. Mediratta,M. Ganesh,Dave Clements,J. Stenoien,Jaideep Srivastava 1994 The MYRIAD Federated Database Prototype. SIGMOD Conference Energy Efficient Indexing on Air. Tomasz Imielinski,S. Viswanathan,B. R. Badrinath 1994 We consider wireless broadcasting of data as a way of disseminating information to a massive number of users. Organizing and accessing information on wireless communication channels is different from the problem of organizing and accessing data on the disk. We describe two methods, (1,m) Indexing and Distributed Indexing, for organizing and accessing broadcast data. We demonstrate that the proposed algorithms lead to significant improvement of battery life, while retaining a low access time. SIGMOD Conference Incomplete Path Expressions and their Disambiguation. Yannis E. Ioannidis,Yezdi Lashkari 1994 When we, humans, talk to each other we have no trouble disambiguating what another person means, although our statements are almost never meticulously specified down to very last detail. We “fill in the gaps” using our common-sense knowledge about the world. We present a powerful mechanism that allows users of object-oriented database systems to specify certain types of ad-hoc queries in a manner closer to the way we pose questions to each other. Specifically, the system accepts as input queries with incomplete, and therefore ambiguous, path expressions. From them, it generates queries with fully-specified path expressions that are consistent with those given as input and capture what the user most likely meant by them. This is achieved by mapping the problem of path expression disambiguation to an optimal path computation (in the transitive closure sense) over a directed graph that represents the schema. Our method works by exploiting the semantics of the kinds of relationships in the schema and requires no special knowledge about the contents of the underlying database, i.e., it is domain independent. In a limited set of experiments with human subjects, the proposed mechanism was very successful in disambiguating incomplete path expressions. SIGMOD Conference Databases for Networks. H. V. Jagadish 1994 Databases for Networks. SIGMOD Conference Optimizing Disjunctive Queries with Expensive Predicates. Alfons Kemper,Guido Moerkotte,Klaus Peithner,Michael Steinbrunn 1994 In this work, we propose and assess a technique called bypass processing for optimizing the evaluation of disjunctive queries with expensive predicates. The technique is particularly useful for optimizing selection predicates that contain terms whose evaluation costs vary tremendously; e.g., the evaluation of a nested subquery or the invocation of a user-defined function in an object-oriented or extended relational model may be orders of magnitude more expensive than an attribute access (and comparison). The idea of bypass processing consists of avoiding the evaluation of such expensive terms whenever the outcome of the entire selection predicate can already be induced by testing other, less expensive terms. In order to validate the viability of bypass evaluation, we extend a previously developed optimizer architecture and incorporate three alternative optimization algorithms for generating bypass processing plans. SIGMOD Conference Distributing a Search Tree Among a Growing Number of Processors. Brigitte Kröll,Peter Widmayer 1994 Databases are growing steadily, and distributed computer systems are more and more easily available. This provides an opportunity to satisfy the increasingly tighter efficiency requirements by means of distributed data structures. The design and analysis of these structures under efficiency aspects, however, has not yet been studied sufficiently. To our knowledge, a single scalable, distributed data structure has been proposed so far. It is a distributed variant of linear hashing with uncontrolled splits, and, as a consequence, performs efficiently for data distributions that are close to uniform, but not necessarily for others. In addition, it does not support queries that refer to the linear order of keys, such as nearest neighbor or range queries. We propose a distributed search tree that avoids these problems, since it inherits desirable properties from non-distributed trees. Our experiments show that our structure does indeed combine a guarantee for good storage space utilization with high query efficiency. Nevertheless, we feel that further research in the area of scalable, distributed data structures is dearly needed; it should eventually lead to a body of knowledge that is comparable with the non-distributed, classical data structures field. SIGMOD Conference A Language Based Multidatabase System. eva Kühn,Thomas Tschernko,Konrad Schwarz 1994 A Language Based Multidatabase System. SIGMOD Conference Object-Oriented Extensions in SQL3: A Status Report. Krishna G. Kulkarni 1994 Object-Oriented Extensions in SQL3: A Status Report. SIGMOD Conference Oracle Media Server: Providing Consumer Based Interactive Access to Multimedia Data. Andrew Laursen,Jeffrey Olkin,Mark Porter 1994 Currently, most data accessed on large servers is structured data stored in traditional databases. Networks are LAN based and clients range from simple terminals to powerful workstations. The user is corporate and the application developer is an MIS professional.With the introduction of broadband communications to the home and better than 100-to-1 compression techniques, a new form of network-based computing is emerging. Structured data is still important, but the bulk of data becomes unstructured: audio, video, news feeds, etc. The predominant user becomes the consumer. The predominant client device becomes the television set. The application developer becomes the storyboard developer, director, or the video production engineer.The Oracle Media Server supports access to all types of conventional data stored in Oracle relational and text databases. In addition, we have developed a real-time stream server that supports storage and playback of real-time audio and video data. The Media Server also provides access to data stored in file systems or as binary large objects (images, executables, etc.)The Oracle Media Server provides a platform for distributed client-server computing and access to data over asymmetric real-time networks. A service mechanism allows applications to be split such that client devices (set-top boxes, personal digital assistants, etc.) can focus on presentation, while backend services running in a distributed server complex, provide access to data via messaging or lightweight RPC (Remote Procedure Call). SIGMOD Conference COSS: The Common Object Services Specifications. Bruce E. Martin 1994 COSS: The Common Object Services Specifications. SIGMOD Conference Self-Adaptive, On-Line Reclustering of Complex Object Data. William J. McIver Jr.,Roger King 1994 A likely trend in the development of future CAD, CASE and office information systems will be the use of object-oriented database systems to manage their internal data stores. The entities that these applications will retrieve, such as electronic parts and their connections or customer service records, are typically large complex objects composed of many interconnected heterogeneous objects, not thousands of tuples. These applications may exhibit widely shifting usage patterns due to their interactive mode of operation. Such a class of applications would demand clustering methods that are appropriate for clustering large complex objects and that can adapt on-line to the shifting usage patterns. While most object-oriented clustering methods allow grouping of heterogeneous objects, they are usually static and can only be changed off-line. We present one possible architecture for performing complex object reclustering in an on-line manner that is adaptive to changing usage patterns. Our architecture involves the decomposition of a clustering method into concurrently operating components that each handle one of the fundamental tasks involved in reclustering, namely statistics collection, cluster analysis, and reorganization. We present the results of an experiment performed to evaluate its behavior. These results show that the average miss rate for object accesses can be effectively reduced using a combination of rules that we have developed for deciding when cluster analyses and reorganizations should be performed. SIGMOD Conference "Enterprise Information Architectures -- They're Finally Changing." Wesley P. Melling 1994 "Enterprise Information Architectures -- They're Finally Changing." SIGMOD Conference MOSAICO - A System for Conceptual Modeling and Rapid Prototyping of Object-Oriented Database Application. Michele Missikoff,M. Toiati 1994 MOSAICO - A System for Conceptual Modeling and Rapid Prototyping of Object-Oriented Database Application. SIGMOD Conference A Survey and Critique of Advanced Transaction Models. C. Mohan 1994 A Survey and Critique of Advanced Transaction Models. SIGMOD Conference ARIES/CSA: A Method for Database Recovery in Client-Server Architectures. C. Mohan,Inderpal Narang 1994 This paper presents an algorithm, called ARIES/CSA (Algorithm for Recovery and Isolation Exploiting Semantics for Client-Server Architectures), for performing recovery correctly in client-server (CS) architectures. In CS, the server manages the disk version of the database. The clients, after obtaining database pages from the server, cache them in their buffer pools. Clients perform their updates on the cached pages and produce log records. The log records are buffered locally in virtual storage and later sent to the single log at the server. ARIES/CSA supports a write-ahead logging (WAL), fine-granularity (e.g., record) locking, partial rollbacks and flexible buffer management policies like steal and no-force. It does not require that the clocks on the clients and the server be synchronized. Checkpointing by the server and the clients allows for flexible and easier recovery. SIGMOD Conference Implementation of Magic-sets in a Relational Database System. Inderpal Singh Mumick,Hamid Pirahesh 1994 We describe the implementation of the magic-sets transformation in the Starburst extensible relational database system. To our knowledge this is the first implementation of the magic-sets transformation in a relational database system. The Starburst implementation has many novel features that make our implementation especially interesting to database practitioners (in addition to database researchers). (1) We use a cost-based heuristic for determining join orders (sips) before applying magic. (2) We push all equality and non-equality predicates using magic, replacing traditional predicate pushdown optimizations. (3) We apply magic to full SQL with duplicates, aggregation, null values, and subqueries. (4) We integrate magic with other relational optimization techniques. (5) The implementation is extensible.Our implementation demonstrates the feasibility of the magic-sets transformation for commercial relational systems, and provides a mechanism to implement magic as an integral part of a new database system, or as an add-on to an existing database system. SIGMOD Conference AlphaSort: A RISC Machine Sort. Chris Nyberg,Tom Barclay,Zarka Cvetanovic,Jim Gray,David B. Lomet 1994 A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads. Using Alpha AXP processors, commodity memory, and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This beats the best published record on a 32-cpu 32-disk Hypercube by 8:1. On another benchmark, AlphaSort sorted more than a gigabyte in a minute.AlphaSort is a cache-sensitive memory-intensive sort algorithm. It uses file striping to get high disk bandwidth. It uses QuickSort to generate runs and uses replacement-selection to merge the runs. It uses shared memory multiprocessors to break the sort into subsort chores.Because startup times are becoming a significant part of the total time, we propose two new benchmarks: (1) Minutesort: how much can you sort in a minute, and (2) DollarSort: how much can you sort for a dollar. SIGMOD Conference Managing Memory for Real-Time Queries. HweeHwa Pang,Michael J. Carey,Miron Livny 1994 "The demanding performance objectives that real-time database systems (RTDBS) face necessitate the use of priority resource scheduling. This paper introduces a Priority Memory Management (PMM) algorithm that is designed to schedule queries in RTDBS. PMM attempts to minimize the number of missed deadlines by adapting both its multiprogramming level and its memory allocation strategy to the characteristics of the offered workload. A series of simulation experiments confirms that PMM's admission control and memory allocation mechanisms are very effective for real-time query scheduling." SIGMOD Conference Object-Oriented Features of DB2 Client/Server. Hamid Pirahesh 1994 Object-Oriented Features of DB2 Client/Server. SIGMOD Conference Sequence Query Processing. Praveen Seshadri,Miron Livny,Raghu Ramakrishnan 1994 Many applications require the ability to manipulate sequences of data. We motivate the importance of sequence query processing, and present a framework for the optimization of sequence queries based on several novel techniques. These include query transformations, optimizations that utilize meta-data, and caching of intermediate results. We present a bottom-up algorithm that generates an efficient query evaluation plan based on cost estimates. This work also identifies a number of directions in which future research can be directed. SIGMOD Conference XSB as an Efficient Deductive Database Engine. Konstantinos F. Sagonas,Terrance Swift,David Scott Warren 1994 "This paper describes the XSB system, and its use as an in-memory deductive database engine. XSB began from a Prolog foundation, and traditional Prolog systems are known to have serious deficiencies when used as database systems. Accordingly, XSB has a fundamental bottom-up extension, introduced through tabling (or memoing)[4], which makes it appropriate as an underlying query engine for deductive database systems. Because it eliminates redundant computation, the tabling extension makes XSB able to compute all modularly stratified datalog programs finitely and with polynomial data complexity. For non-stratified programs, a meta-interpreter with the same properties is provided. In addition XSB significantly extends and improves the indexing capabilities over those of standard Prolog. Finally, its syntactic basis in HiLog [2], lends it flexibility for data modelling.The implementation of XSB derives from the WAM [25], the most common Prolog engine. XSB inherits the WAM's efficiency and can take advantage of extensive compiler technology developed for Prolog. As a result, performance comparisons indicate that XSB is significantly faster than other deductive database systems for a wide range of queries and stratified rule sets. XSB is under continuous development, and version 1.3 is available through anonymous ftp." SIGMOD Conference XSB as a Deductive Database. Konstantinos F. Sagonas,Terrance Swift,David Scott Warren 1994 XSB as a Deductive Database. SIGMOD Conference Estimating Page Fetches for Index Scans with Finite LRU Buffers. Arun N. Swami,K. Bernhard Schiefer 1994 We describe an algorithm for estimating the number of page fetches for a partial or complete scan of a B-tree index. The algorithm obtains estimates for the number of page fetches for an index scan when given the number of tuples selected and the number of LRU buffers currently available. The algorithm has an initial phase that is performed exactly once before any estimates are calculated. This initial phase, involving LRU buffer modeling, requires a scan of all the index entries and calculates the number of page fetches for different buffer sizes. An approximate empirical model is obtained from this data. Subsequently, an inexpensive estimation procedure is called by the query optimizer whenever it needs an estimate of the page fetches for the index scan. This procedure utilizes the empirical model obtained in the initial phase. SIGMOD Conference Relaxed Transaction Processing. Munindar P. Singh,Christine Tomlinson,Darrell Woelk 1994 Relaxed Transaction Processing. SIGMOD Conference The ORES Temporal Database Management System. Babis Theodoulidis,Aziz Ait-Braham,George Andrianopoulos,Jayant Chaudhary,George Karvelis,Simon Sou 1994 The ORES Temporal Database Management System. SIGMOD Conference Incremental Updates of Inverted Lists for Text Document Retrieval. Anthony Tomasic,Hector Garcia-Molina,Kurt A. Shoens 1994 "With the proliferation of the world's “information highways” a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index. The index dynamically separates long and short inverted lists and optimizes retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering trade-offs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria." SIGMOD Conference The Montage Extensible DataBlade Achitecture. Michael Ubell 1994 The Montage Extensible DataBlade Achitecture. SIGMOD Conference Database in Crisis and Transition: A Technical Agenda for the Year 2001. David Vaskevitch 1994 The current paper outlines a number of important changes that face the database community and presents an agenda for how some of these challenges can be met. This database agenda is currently being addressed in the Enterprise Group at Microsoft Corporation. The paper concludes with a scenario for 2001 which reflects the Microsoft vision of “Information at your fingertips.” SIGMOD Conference Distributed File Organization with Scalable Cost/Performance. Radek Vingralek,Yuri Breitbart,Gerhard Weikum 1994 This paper presents a distributed file organization for record-structured, disk-resident files with key-based exact-match access. The file is organized into buckets that are spread across multiple servers, where a server may hold multiple buckets. Client requests are serviced by mapping keys onto buckets and looking up the corresponding server in an address table. Dynamic growth in terms of file size and access load is supported by bucket splits and migration onto other existing or newly acquired servers.The significant and challenging problem addressed here is how to achieve scalability so that both the file size and the client throughput can be scaled up by linearly increasing the number of servers and dynamically redistributing data. Unlike previous work with similar objectives, our data redistribution considers explicitly the cost/performance ratio of the system by aiming to minimize the number of servers that are acquired to provide the required performance. A new server is acquired only if the overall server utilization in the system does not drop below a specified threshold. Preliminary simulation results show that the goal of scalability with controlled cost/performance is indeed achieved to a large extent. SIGMOD Conference Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results. Jason Tsong-Li Wang,Gung-Wei Chirn,Thomas G. Marr,Bruce A. Shapiro,Dennis Shasha,Kaizhong Zhang 1994 "Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today." SIGMOD Conference QuickStore: A High Performance Mapped Object Store. Seth J. White,David J. DeWitt 1994 QuickStore is a memory-mapped storage system for persistent C++, built on top of the EXODUS Storage Manager. QuickStore provides fast access to in-memory objects by allowing application programs to access objects via normal virtual memory pointers. This article presents the results of a detailed performance study using the OO7 benchmark. The study compares the performance of QuickStore with the latest implementation of the E programming language. The QuickStore and E systems exemplify the two basic approaches (hardware and software) that have been used to implement persistence in object-oriented database systems. In addition, both systems use the same underlying storage manager and compiler, allowing us to make a truly apples-to-apples comparison of the hardware and software techniques. SIGMOD Conference Ensuring Relaxed Atomicity for Flexible Transactions in Multidatabase Systems. Aidong Zhang,Marian H. Nodine,Bharat K. Bhargava,Omran A. Bukhres 1994 Global transaction management requires cooperation from local sites to ensure the consistent and reliable execution of global transactions in a distributed database system. In a heterogeneous distributed database (or multidatabase) environment, various local sites make conflicting assertions of autonomy over the execution of global transactions. A flexible transaction model for the specification of global transactions makes it possible to deal robustly with these conflicting requirements. This paper presents an approach that preserves the semi-atomicity (a weaker form of atomicity) of flexible transactions, allowing local sites to autonomously maintain serializability and recoverability. We offer a fundamental characterization of the flexible transaction model and precisely define the semi-atomicity. We investigate the commit dependencies among the subtransactions of a flexible transaction. These dependencies are used to control the commitment order of the subtransactions. We next identify those restrictions that must be placed upon a flexible transaction to ensure the maintenance of its semi-atomicity. As atomicity is a restrictive criterion, semi-atomicity enhances the class of executable global transactions. VLDB Supporting Exceptions to Schema Consistency to Ease Schema Evolution in OODBMS. Eric Amiel,Marie-Jo Bellosta,Eric Dujardin,Eric Simon 1994 Supporting Exceptions to Schema Consistency to Ease Schema Evolution in OODBMS. VLDB An Empirical Performance Study of the Ingres Search Accelerator for a Large Property Management Database System. Sarabjot S. Anand,David A. Bell,John G. Hughes 1994 An Empirical Performance Study of the Ingres Search Accelerator for a Large Property Management Database System. VLDB Fast Algorithms for Mining Association Rules in Large Databases. Rakesh Agrawal,Ramakrishnan Srikant 1994 Fast Algorithms for Mining Association Rules in Large Databases. VLDB A Transaction Replication Scheme for a Replicated Database with Node Autonomy. Ada Wai-Chee Fu,David Wai-Lok Cheung 1994 A Transaction Replication Scheme for a Replicated Database with Node Autonomy. VLDB An Algebraic Approach to Rule Analysis in Expert Database Systems. Elena Baralis,Jennifer Widom 1994 "Expert database systems extend the functionality of conventional database systems by providing a facility for creating and automatically executing Condition-Action rules. While Condition-Action rules in database systems are very powerful, they also can be very difficult to program, due to the unstructured and unpredictable nature of rule processing. We provide methods for static analysis of Condition-Action rules; our methods determine whether a given rule set is guaranteed to terminate, and whether rule execution is confluent (has a guaranteed unique final state). Our methods are based on previous methods for analyzing rules in active database systems. We improve considerably on the previous methods by providing analysis criteria that are much less conservative: our methods often determine that a rule set will terminate or is confluent when previous methods could not. Our improved analysis is based on a ``propagation'''' algorithm, which uses a formal approach based on an extended relational algebra to accurately determine when the action of one rule can affect the condition of another. Our algebraic approach yields methods that are applicable to a broad class of expert database rule languages." VLDB An Effective Deductive Object-Oriented Database Through Language Integration. Maria L. Barja,Norman W. Paton,Alvaro A. A. Fernandes,M. Howard Williams,Andrew Dinn 1994 An Effective Deductive Object-Oriented Database Through Language Integration. VLDB An Overview of Repository Technology. Philip A. Bernstein,Umeshwar Dayal 1994 An Overview of Repository Technology. VLDB PC Database Systems - Present and Future. Philip A. Bernstein 1994 PC Database Systems - Present and Future. VLDB Semantic Integration in Heterogeneous Databases Using Neural Networks. Wen-Syan Li,Chris Clifton 1994 Semantic Integration in Heterogeneous Databases Using Neural Networks. VLDB The Impact of Global Clustering on Spatial Database Systems. Thomas Brinkhoff,Hans-Peter Kriegel 1994 The Impact of Global Clustering on Spatial Database Systems. VLDB Fast Incremental Indexing for Full-Text Information Retrieval. Eric W. Brown,James P. Callan,W. Bruce Croft 1994 Fast Incremental Indexing for Full-Text Information Retrieval. VLDB Towards Automated Performance Tuning for Complex Workloads. Kurt P. Brown,Manish Mehta,Michael J. Carey,Miron Livny 1994 Towards Automated Performance Tuning for Complex Workloads. VLDB Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng,Jiawei Han 1994 Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In this paper, we explore whether clustering methods have a role to play in spatial data mining. To this end, we develop a new clustering method called CLARANS which is based on randomized search. We also develop two spatial data mining algorithms that use CLARANS. Our analysis and experiments show that with the assistance of CLARANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms. Furthermore, experiments conducted to compare the performance of CLARANS with that of existing clustering methods show that CLARANS is the most efficient. VLDB Maximizing Buffer and Disk Utilizations for News On-Demand. Raymond T. Ng,Jinhai Yang 1994 Maximizing Buffer and Disk Utilizations for News On-Demand. VLDB Composite Events for Active Databases: Semantics, Contexts and Detection. Sharma Chakravarthy,V. Krishnaprasad,Eman Anwar,S.-K. Kim 1994 Composite Events for Active Databases: Semantics, Contexts and Detection. VLDB Content-Based Image Indexing. Tzi-cker Chiueh 1994 Content-Based Image Indexing. VLDB Including Group-By in Query Optimization. Surajit Chaudhuri,Kyuseok Shim 1994 Including Group-By in Query Optimization. VLDB On Index Selection Schemes for Nested Object Hierarchies. Sudarshan S. Chawathe,Ming-Syan Chen,Philip S. Yu 1994 On Index Selection Schemes for Nested Object Hierarchies. VLDB A Multidatabase System for Tracking and Retrieval of Financial Data. Munir Cochinwala,John Bradley 1994 A Multidatabase System for Tracking and Retrieval of Financial Data. VLDB NAOS - Efficient and Modular Reactive Capabilities in an Object-Oriented Database System. Christine Collet,Thierry Coupaye,T. Svensen 1994 NAOS - Efficient and Modular Reactive Capabilities in an Object-Oriented Database System. VLDB Memory-Contention Responsive Hash Joins. Diane L. Davison,Goetz Graefe 1994 Memory-Contention Responsive Hash Joins. VLDB Client-Server Paradise. David J. DeWitt,Navin Kabra,Jun Luo,Jignesh M. Patel,Jie-Bing Yu 1994 Client-Server Paradise. VLDB Implementing Lazy Database Updates for an Object Database System. Fabrizio Ferrandina,Thorsten Meyer,Roberto Zicari 1994 Implementing Lazy Database Updates for an Object Database System. VLDB OdeFS: A File System Interface to an Object-Oriented Database. Narain H. Gehani,H. V. Jagadish,William D. Roome 1994 OdeFS: A File System Interface to an Object-Oriented Database. VLDB Access to Objects by Path Expressions and Rules. Jürgen Frohn,Georg Lausen,Heinz Uphoff 1994 Object oriented databases provide rich structuring capabilities to organize the objects being relevant for a given application. Due to the possible complexity of object structures, path expressions have become accepted as a concise syntactical means to reference objects. Even though known approaches to path expressions provide quite elegant access to objects, there seems to be still a need for more generality. To this end, the rule-language PathLog is introduced. A first contribution of PathLog is to add a second dimension to path expressions in order to increase conciseness. In addition, a path expression can also be used to reference virtual objects. Both enhancements give rise to interesting semantic implications. VLDB Qualified Answers That Reflect User Needs and Preferences. Terry Gaasterland,Jorge Lobo 1994 Qualified Answers That Reflect User Needs and Preferences. VLDB Fast, Randomized Join-Order Selection - Why Use Transformations? César A. Galindo-Legaria,Arjan Pellenkoft,Martin L. Kersten 1994 Fast, Randomized Join-Order Selection - Why Use Transformations? VLDB Building a Laboratory Information System Around a C++-Based Object-Oriented DBMS. Nathan Goodman,Steve Rozen,Lincoln Stein 1994 Building a Laboratory Information System Around a C++-Based Object-Oriented DBMS. VLDB Database Graph Views: A Practical Model to Manage Persistent Graphs. Alejandro Gutiérrez,Philippe Pucheral,Hermann Steffen,Jean-Marc Thévenin 1994 Database Graph Views: A Practical Model to Manage Persistent Graphs. VLDB GraphDB: Modeling and Querying Graphs in Databases. Ralf Hartmut Güting 1994 GraphDB: Modeling and Querying Graphs in Databases. VLDB Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism. Waqar Hasan,Rajeev Motwani 1994 Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism. VLDB Modelling and Querying Video Data. Rune Hjelsvold,Roger Midtstraum 1994 Modelling and Querying Video Data. VLDB Performance of Data-Parallel Spatial Operations. Erik G. Hoel,Hanan Samet 1994 Performance of Data-Parallel Spatial Operations. VLDB Providing Dynamic Security Control in a Federated Database. Norbik Bashah Idris,W. A. Gray,R. F. Churchhouse 1994 Providing Dynamic Security Control in a Federated Database. VLDB Data Compression Support in Databases. Balakrishna R. Iyer,David Wilhite 1994 Data Compression Support in Databases. VLDB Dalí: A High Performance Main Memory Storage Manager. H. V. Jagadish,Daniel F. Lieuwen,Rajeev Rastogi,Abraham Silberschatz,S. Sudarshan 1994 Dalí: A High Performance Main Memory Storage Manager. VLDB 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. Theodore Johnson,Dennis Shasha 1994 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. VLDB An Approach for Building Secure Database Federations. Dirk Jonscher,Klaus R. Dittrich 1994 An Approach for Building Secure Database Federations. VLDB Hilbert R-tree: An Improved R-tree using Fractals. Ibrahim Kamel,Christos Faloutsos 1994 Hilbert R-tree: An Improved R-tree using Fractals. VLDB Dual-Buffering Strategies in Object Bases. Alfons Kemper,Donald Kossmann 1994 Dual-Buffering Strategies in Object Bases. VLDB Indexing Multiple Sets. Christoph Kilger,Guido Moerkotte 1994 Indexing Multiple Sets. VLDB Experiments on Access to Digital Libraries: How can Images and Text be Used Together. Michael Lesk 1994 Experiments on Access to Digital Libraries: How can Images and Text be Used Together. VLDB Query Optimization by Predicate Move-Around. Alon Y. Levy,Inderpal Singh Mumick,Yehoshua Sagiv 1994 Query Optimization by Predicate Move-Around. VLDB Challenges for Global Information Systems. Alon Y. Levy,Abraham Silberschatz,Divesh Srivastava,Maria Zemankova 1994 Challenges for Global Information Systems. VLDB RP*: A Family of Order Preserving Scalable Distributed Data Structures. Witold Litwin,Marie-Anne Neimat,Donovan A. Schneider 1994 RP*: A Family of Order Preserving Scalable Distributed Data Structures. VLDB On Spatially Partitioned Temporal Join. Hongjun Lu,Beng Chin Ooi,Kian-Lee Tan 1994 On Spatially Partitioned Temporal Join. VLDB Relating Distributed Objects. Bruce E. Martin,R. G. G. Cattell 1994 Relating Distributed Objects. VLDB Persistent Threads. Florian Matthes,Joachim W. Schmidt 1994 Persistent Threads. VLDB V-Trees - A Storage Method for Long Vector Data. Maurício R. Mediano,Marco A. Casanova,Marcelo Dreux 1994 V-Trees - A Storage Method for Long Vector Data. VLDB Some Issues in Design of Distributed Deductive Databases. Mukesh K. Mohania,Nandlal L. Sarda 1994 Some Issues in Design of Distributed Deductive Databases. VLDB A Requirement-Based Approach to Data Modeling and Re-Engineering. Alice H. Muntz,Christian T. Ramiller 1994 A Requirement-Based Approach to Data Modeling and Re-Engineering. VLDB A Top-Down Approach for Two Level Serializability. Mourad Ouzzani,M. A. Atroun,N. L. Belkhodja 1994 A Top-Down Approach for Two Level Serializability. VLDB A Low-Cost Storage Server for Movie on Demand Databases. Banu Özden,Alexandros Biliris,Rajeev Rastogi,Abraham Silberschatz 1994 A Low-Cost Storage Server for Movie on Demand Databases. VLDB Materialization: A Powerful and Ubiquitous Abstraction Pattern. Alain Pirotte,Esteban Zimányi,David Massart,Tatiana Yakusheva 1994 Materialization: A Powerful and Ubiquitous Abstraction Pattern. VLDB Investigation of Algebraic Query Optimisation Techniques for Database Programming Languages. Alexandra Poulovassilis,Carol Small 1994 Investigation of Algebraic Query Optimisation Techniques for Database Programming Languages. VLDB Data Integration in the Large: The Challenge of Reuse. Arnon Rosenthal,Leonard J. Seligman 1994 Data Integration in the Large: The Challenge of Reuse. VLDB New Concurrency Control Algorithms for Accessing and Compacting B-Trees. V. W. Setzer,Andrea Zisman 1994 New Concurrency Control Algorithms for Accessing and Compacting B-Trees. VLDB Cache Conscious Algorithms for Relational Query Processing. Ambuj Shatdal,Chander Kant,Jeffrey F. Naughton 1994 Cache Conscious Algorithms for Relational Query Processing. VLDB Reasoning About Spatial Relationships in Picture Retrieval Systems. A. Prasad Sistla,Clement T. Yu,R. Haddad 1994 Reasoning About Spatial Relationships in Picture Retrieval Systems. VLDB User Interfaces; Who Cares? Stefano Spaccapietra 1994 User Interfaces; Who Cares? VLDB The hcC-tree: An Efficient Index Structure for Object Oriented Databases. B. Sreenath,S. Seshadri 1994 The hcC-tree: An Efficient Index Structure for Object Oriented Databases. VLDB Cumulative Updates. Suryanarayana M. Sripada,Beat Wüthrich 1994 Cumulative Updates. VLDB From Nested-Loop to Join Queries in OODB. Hennie J. Steenhagen,Peter M. G. Apers,Henk M. Blanken,Rolf A. de By 1994 From Nested-Loop to Join Queries in OODB. VLDB Towards Event-Driven Modelling for Database Design. Maguelonne Teisseire,Pascal Poncelet,Rosine Cicchetti 1994 Towards Event-Driven Modelling for Database Design. VLDB The GMAP: A Versatile Tool for Physical Data Independence. Odysseas G. Tsatalos,Marvin H. Solomon,Yannis E. Ioannidis 1994 Physical data independence is touted as a central feature of modern database systems. It allows users to frame queries in terms of the logical structure of the data, letting a query processor automatically translate them into optimal plans that access physical storage structures. Both relational and object-oriented systems, however, force users to frame their queries in terms of a logical schema that is directly tied to physical structures. We present an approach that eliminates this dependence. All storage structures are defined in a declarative language based on relational algebra as functions of a logical schema. We present an algorithm, integrated with a conventional query optimizer, that translates queries over this logical schema into plans that access the storage structures. We also show how to compile update requests into plans that update all relevant storage structures consistently and optimally. Finally, we report on experiments with a prototype implementation of our approach that demonstrate how it allows storage structures to be tuned to the expected or observed workload to achieve significantly better performance than is possible with conventional techniques. VLDB Bulk Loading into an OODB: A Performance Study. Janet L. Wiener,Jeffrey F. Naughton 1994 Bulk Loading into an OODB: A Performance Study. VLDB Join Index Hierarchies for Supporting Efficient Navigations in Object-Oriented Databases. Zhaohui Xie,Jiawei Han 1994 Join Index Hierarchies for Supporting Efficient Navigations in Object-Oriented Databases. VLDB Integrating a Structured-Text Retrieval System with an Object-Oriented Database System. Tak W. Yan,Jurgen Annevelink 1994 Integrating a Structured-Text Retrieval System with an Object-Oriented Database System. VLDB Scientific Databases - State of the Art and Future Directions. Maria Zemankova,Yannis E. Ioannidis 1994 Scientific Databases - State of the Art and Future Directions. SIGMOD Record Loading Databases Using Dataflow Parallelism. Tom Barclay,Robert Barnes,Jim Gray,Prakash Sundaresan 1994 "This paper describes a parallel database load prototype for Digital's Rdb database product. The prototype takes a dataflow approach to database parallelism. It includes an explorer that discovers and records the cluster configuration in a database, a client CUI interface that gathers the load job description from the user and from the Rdb catalogs, and an optimizer that picks the best parallel execution plan and records it in a web data structure. The web describes the data operators, the dataflow rivers among them, the binding of operators to processes, processes to processors, and files to discs and tapes. This paper describes the optimizer's cost-based hierarchical optimization strategy in some detail. The prototype executes the web's plan by spawning a web manager process at each node of the cluster. The managers create the local executor processes, and orchestrate startup, phasing, checkpoint, and shutdown. The execution processes perform one or more operators. Data flows among the operators are via memory-to-memory streams within a node, and via web-manager multiplexed tcp/ip streams among nodes. The design of the transaction and checkpoint/restart mechanisms are also described. Preliminary measurements indicate that this design will give excellent scaleups." SIGMOD Record Data Modelling in the Large. Martin Bertram 1994 Data Modelling in the Large. SIGMOD Record Trade Press News. Rafael Alonso 1994 Trade Press News. SIGMOD Record Trade Press News. Rafael Alonso 1994 Trade Press News. SIGMOD Record Trade Press News - Announcement and Preface. Rafael Alonso 1994 Trade Press News - Announcement and Preface. SIGMOD Record SEQUOIA 2000 Metadata Schema for Satellite Images. Jean T. Anderson,Michael Stonebraker 1994 Sequoia 2000 schema development is based on emerging geospatial standards to accelerate development and facilitate data exchange. This paper focuses on the metadata schema for digital satellite images. We examine how satellite metadata are defined, used, and maintained. We discuss the geospatial standards we are using, and describe a SQL prototype that is based on the Spatial Archive and Interchange Format (SAIF) standard and implemented in the Illustra object-relational database. SIGMOD Record The Impact of Database Research on Industrial Products (Panel Summary). Daniel Barbará,José A. Blakeley,Daniel H. Fishman,David B. Lomet,Michael Stonebraker 1994 The Impact of Database Research on Industrial Products (Panel Summary). SIGMOD Record Metadata for Multimedia Documents. Klemens Böhm,Thomas C. Rakow 1994 In this article metadata for mulimedia documents are classified in conformity with their nature, and the different kinds of metadata are brought into relation with the different purposes intended. We describe how metadata may be organized in accordance with the ISO standards SGML, which facilitates the handling of structured documents, and DFR, which supports the storage of collections of documents. Finally, we outline the impact of our observations on future developments. SIGMOD Record Comprehension Syntax. Peter Buneman,Leonid Libkin,Dan Suciu,Val Tannen,Limsoon Wong 1994 "The syntax of comprehensions is very close to the syntax of a number of practical database query languages and is, we believe, a better starting point than first-order logic for the development of database languages. We give an informal account of a language based on comprehension syntax that deals uniformly with a variety of collection types; it also includes pattern matching, variant types and function definition. We show, again informally, how comprehension syntax is a natural fragment of structural recursion, a much more powerful programming paradigm for collection types. We also show that a very small ""abstract syntax language"" can serve as a basis for the implementation and optimization of comprehension syntax." SIGMOD Record Database Research at NTHU and ITRI. Arbee L. P. Chen 1994 Database Research at NTHU and ITRI. SIGMOD Record Metadata for Mixed-Media Access. Francine Chen,Marti A. Hearst,Julian Kupiec,Jan O. Pedersen,Lynn Wilcox 1994 In this paper, we discuss mixed-media access, an information access paradigm for multimedia data in which the media type of a query may differ from that of the data. The types of media considered in this paper are speech, images of text, and full-length text. Some examples of metadata for mixed-media access are locations of keywords in speech and images, identification of speakers, locations of emphasized regions in speech, and locations of topic boundaries in text. Algorithms for automatically generating this metadata are described, including word spotting, speaker segmentation, emphatic speech detection, and subtopic boundary location. We illustrate queries composed of diverse media types in an example of access to recorded meetings, via speaker and keyword location. SIGMOD Record Research Issues in Databases for ARCS: Active Rapidly Changing Data Systems. Anindya Datta 1994 We identify an emergent class of database systems that has not been dealt with extensively in the literature that we call ARCS (Active, Rapidly Changing data Systems) databases. These systems impose certain unique requirements on databases that monitor and control them. These requirements are such that traditional data and transaction management models appear inadequate. We present an analysis of data and transaction characteristics in ARCS systems and identify relevant research issues. SIGMOD Record Supporting Dynamic Displays Using Active Rules. Oscar Díaz,Arturo Jaime,Norman W. Paton,Ghassan al-Qaimari 1994 In a graphical interface which is used to display database objects, dynamic displays are updated automatically as modifications occur to the database objects being visualised. Approaches based on enlarging either the database system or the interface code to provide the appropriate communication, complicates the interaction between the two systems, as well as making later updates cumbersome. In this paper, an approach based on active rules is presented. The declarative and modular description of active rules enables active displays to be supported with minimal changes to the database or its graphical interface. Although this approach has been used to support the link between a database system and its graphical interface, it can easily be adapted to support dynamic interaction between an active database system and other external systems. SIGMOD Record Research Perspectives for Time Series Management Systems. Werner Dreyer,Angelika Kotz Dittrich,Duri Schmidt 1994 Empirical research based on time series is a data intensive activity that needs a data base management system (DBMS). We investigate the special properties a time series management system (TSMS) should have. We then show that currently available solutions and related research directions are not well suited to handle the existing problems. Therefore, we propose the development of a special purpose TSMS, which will offer particular modeling, retrieval, and computation capabilities. It will be suitable for end users, offer direct manipulation interfaces, and allow data exchange with a variety of data sources, including other databases and application packages. We intend to build such a system on top of an off-the-shelf object-oriented DBMS. SIGMOD Record Influencing Database Language Standards. Leonard Gallagher 1994 In this first article of the regular column on data base standardization activities, I give an overview of topic areas under active development in the formal national and international standardization bodies. I solicit contributions on these active topics so that standardizers and researchers can cooperate in the near term, before irreversible decisions are made, to produce the most useful and highest quality database standards. SIGMOD Record Constructing the Next 100 Database Management Systems. Andreas Geppert,Klaus R. Dittrich 1994 Constructing the Next 100 Database Management Systems. SIGMOD Record Metadata for Integrating Speech Documents in a Text Retrieval System. Ulrike Glavitsch,Peter Schäuble,Martin Wechsler 1994 "We present an information retrieval system that simultaneously allows to search for text and speech documents. The retrieval system accepts vague queries and performs a best-match search to find those documents that are relevant to the query. The output of the retrieval system is a list of ranked documents where the documents on the top of the list satisfy best the user's information need. The relevance of the documents is estimated by means of metadata (document description vectors). The metadata is automatically generated and it is organized such that queries can be processed efficiently. We introduce a controlled indexing vocabulary for both speech and text documents. The size of the new indexing vocabulary is small (1000 features) compared with the sizes of indexing vocabularies of conventional text retrieval (10000 - 100000 features). We show that the retrieval effectiveness based on such a small indexing vocabulary is similar to the retrieval effectiveness of a Boolean retrieval system." SIGMOD Record Using Metadata for the Intelligent Browsing of Structured Media Objects. William I. Grosky,Farshad Fotouhi,Ishwar K. Sethi 1994 Using Metadata for the Intelligent Browsing of Structured Media Objects. SIGMOD Record Response to the March 1994 ODMG-93 Commentary Written by Dr. Won Kim of UniSQL, Inc. Object Database Management Group 1994 Response to the March 1994 ODMG-93 Commentary Written by Dr. Won Kim of UniSQL, Inc. SIGMOD Record Metadata in Video Databases. Ramesh Jain,Arun Hampapur 1994 Video is composed of audio-visual information. Providing content based access to video data is essential for the sucessful integration of video into computers. Organizing video for content based access requires the use of video metadata. This paper explores the nature video metadata. A data model for video databases is presented based on a study of the applications of video, the nature of video retrieval requests, and the features of video. The data model is used in the architectural framework of a video database. The current state of technology in video databases is summarized and research issues are highlighted. SIGMOD Record A Consensus Glossary of Temporal Database Concepts. Christian S. Jensen,James Clifford,Ramez Elmasri,Shashi K. Gadia,Patrick J. Hayes,Sushil Jajodia 1994 A Consensus Glossary of Temporal Database Concepts. SIGMOD Record Observations on the ODMG-93 Proposal. Won Kim 1994 Observations on the ODMG-93 Proposal. SIGMOD Record A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning. Yasushi Kiyoki,Takashi Kitagawa,Takanari Hayama 1994 "In the design of multimedia database systems, one of the most important issues is to extract images dynamically according to the user's impression and the image's contents. In this paper, we present a metadatabase system which realizes the semantic associative search for images by giving keywords representing the user's impression and the image's contents.This metadatabase system provides several functions for performing the semantic associative search for images by using the metadata representing the features of images. These functions are realized by using our proposed mathematical model of meaning. The mathematical model of meaning is extended to compute specific meanings of keywords which are used for retrieving images unambiguously and dynamically. The main feature of this model is that the semantic associative search is performed in the orthogonal semantic space. This space is created for dynamically computing semantic equivalence or similarity between the metadata items of the images and keywords." SIGMOD Record Metadata for Digital Media: Introduction to the Special Issue. Wolfgang Klas,Amit P. Sheth 1994 Metadata for Digital Media: Introduction to the Special Issue. SIGMOD Record How to Modify SQL Queries in Order to Guarantee Sure Answers. Hans-Joachim Klein 1994 Some problems connected with the handling of null values in SQL are discussed. A definition of sure answers to SQL queries is proposed which takes care of the “no information” meaning of null values in SQL. An algorithm is presented for modifying SQL queries such that answers are not changed for databases without null values but sure answers are obtained for arbitrary databases with standard SQL semantics. SIGMOD Record Text Databases: A Survey of Text Models and Systems. Arjan Loeffen 1994 Text models focus on the manipulation of textual data. They describe texts by their structure, operations on the texts, and constraints on both structure and operations. In this article common characteristics of machine readable texts in general are outlined. Subsequently, ten text models are introduced. They are described in terms of the datatypes that they support, and the operations defined by these datatypes. Finally, the models are compared. SIGMOD Record Recent Design Trade-offs in SQL3. Nelson Mendonça Mattos,Linda G. DeMichiel 1994 Recent Design Trade-offs in SQL3. SIGMOD Record Databases for GIS. Claudia Bauzer Medeiros,Fatima Pires 1994 Databases for GIS. SIGMOD Record The Database Research Group at ETH Zurich. Moira C. Norrie,Stephen Blott,Hans-Jörg Schek,Gerhard Weikum 1994 The Database Research Group at ETH Zurich. SIGMOD Record Towards an Infrastructure for Temporal Databases: Report of an Invitational ARPA/NSF Workshop. Niki Pissinou,Richard T. Snodgrass,Ramez Elmasri,Inderpal Singh Mumick,M. Tamer Özsu,Barbara Pernici,Arie Segev,Babis Theodoulidis,Umeshwar Dayal 1994 Towards an Infrastructure for Temporal Databases: Report of an Invitational ARPA/NSF Workshop. SIGMOD Record Jumping on the NII Bandwagon. Xiaolei Qian 1994 Many requests for proposals have been issued since the last issue of this column appeared six months ago. We first briefly touch upon some recent developments along the policy/legislation front concerning NSF, ARPA, and HPCC. We then recap the recent requests for proposals from ARPA, NSF, Air Force, NASA, and Army. SIGMOD Record Announcements from NSF, NASA, and Elsewhere. Xiaolei Qian 1994 Announcements from NSF, NASA, and Elsewhere. SIGMOD Record Medical Information Systems: Characterization and Challenges. Jorge C. G. Ramirez,Lon A. Smith,Lynn L. Peterson 1994 This paper examines the characteristics and challenges presented by medical databases and medical information systems. It begins with a survey of medical databases/information systems. This is followed by a list of challenges for database management systems generated by the needs of these systems. It concludes with a look at some systems which address these challenges. In the context of this background information, the database community is asked to consider whether the results of database research are reaching those who are making day-to-day decisions regarding design and implementation of medical information systems. SIGMOD Record Unix RDBMS: The Next Generation. Bill Rosneblatt 1994 Unix RDBMS: The Next Generation. SIGMOD Record "Editor's Notes." Arie Segev 1994 "Editor's Notes." SIGMOD Record "Editor's Notes." Arie Segev 1994 "Editor's Notes." SIGMOD Record "Editor's Notes and Erratum." Arie Segev 1994 "Editor's Notes and Erratum." SIGMOD Record A New Join Algorithm. Dong Keun Shin,Arnold Charles Meltzer 1994 This paper introduces a new efficient join algorithm to increase the speed of the join relational operation. Using a divide and conquer strategy, stack oriented filter technique in the new join algorithm filters unwanted tuples as early as possible while none of the currently existing join algorithms takes advantage of any filtering concept. Other join algorithms may carry the unnecessary tuples up to the last moment of join attribute comparisons.Four join algorithms are described and discussed in this paper: the nested-loop join algorithm, the sort-merge join algorithm, the hash join algorithm, and the new join algorithm. SIGMOD Record In Memory of Bob Kooi (1951-1993). Michael Stonebraker 1994 In Memory of Bob Kooi (1951-1993). SIGMOD Record Overview of the Special Section on Temporal Database Infrastructure. Richard T. Snodgrass 1994 Overview of the Special Section on Temporal Database Infrastructure. SIGMOD Record TSQL2 Language Specification. Richard T. Snodgrass,Ilsoo Ahn,Gad Ariav,Don S. Batory,James Clifford,Curtis E. Dyreson,Ramez Elmasri,Fabio Grandi,Christian S. Jensen,Wolfgang Käfer,Nick Kline,Krishna G. Kulkarni,T. Y. Cliff Leung,Nikos A. Lorentzos,John F. Roddick,Arie Segev,Michael D. Soo,Suryanarayana M. Sripada 1994 TSQL2 Language Specification. SIGMOD Record A TSQL2 Tutorial. Richard T. Snodgrass,Ilsoo Ahn,Gad Ariav,Don S. Batory,James Clifford,Curtis E. Dyreson,Ramez Elmasri,Fabio Grandi,Christian S. Jensen,Wolfgang Käfer,Nick Kline,Krishna G. Kulkarni,T. Y. Cliff Leung,Nikos A. Lorentzos,John F. Roddick,Arie Segev,Michael D. Soo,Suryanarayana M. Sripada 1994 A TSQL2 Tutorial. SIGMOD Record "Are the Terms ""Version"" and ""Variant"" Orthogonal to One Another? A Critical Assessment of the STEP Standardization." Hartmut Wedekind 1994 "Are the Terms ""Version"" and ""Variant"" Orthogonal to One Another? A Critical Assessment of the STEP Standardization." SIGMOD Record "Research Issues in Active Database Systems: Report from the Closing Panel at RIDE-ADS '94." Jennifer Widom 1994 The discussions during the panel stayed largely but not entirely focused on the question of active database research issues from the application perspective. There were nine panelists. Each panelist was asked to prepare brief answers to a set of questions. The sets of answers were discussed by all participants, and finally a number of more general issues were discussed. The questions asked of the panelists were: Name an application that will certainly be supported by active database systems in the not-too-distant future. Name an application that will certainly not be supported by active database systems in the near future. Name an area of active database systems in which you are not working but that is crucial to meet the needs of applications. Name an area of active database systems that is not on the critical path to supporting applications. Name an area of active database systems that should have been discussed in the course of the workshop but was not. SIGMOD Record Progress on HPCC and NII. Marianne Winslett 1994 In this issue we briefly touch on the continuing turmoil over NSF, ARPA, and HPCC, and the brighter news regarding the US National Information Infrastructure plan. We then describe funding opportunities from NSF, ARPA, the National Security Agency, the National Center for Automated Information Research, the Air Force, and NASA. SIGMOD Record The TSQL2 Final Language Definition Announcement. 1994 The TSQL2 Final Language Definition Announcement. SIGMOD Record Calls for Papers / Announcements. 1994 Calls for Papers / Announcements. SIGMOD Record Calls for Papers / Announcements. 1994 Calls for Papers / Announcements. SIGMOD Record A Hypertext Query Language for Images. Li Yang 1994 HYPERQUERY is a hypertext query language for object-oriented pictorial database systems. First, we discuss object calculus based on term rewriting. Then, example queries are used to illustrate language facilities. This query language has been designed with a flavor similar to QBE as the highly nonprocedural and conversational language for object-oriented pictorial database management system OISDBS. SIGMOD Record Performance Evaluation of a New Distributed Deadlock Detection Algorithm. Chim-fu Yeung,Sheung-lun Hung,Kam-yiu Lam 1994 "In this paper, a new probe-based distributed deadlock detection algorithm is proposed. It is an enhanced version of the algorithm originally proposed by Chandy's et al. [5,6]. The new algorithm has proven to be error free and suffers very little performance degradation from the additional deadlock detection overhead. The algorithm has been compared with the modified probe-based and timeout methods. It is found that under high data contention, it has the best performance. Results also indicate that the rate of probe initiation is significantly reduced in the new algorithm." ICDE Semantic Query Optimization for Methods in Object-Oriented Database Systems. Karl Aberer,Gisela Fischer 1995 "Although the main difference between the relational and the object-oriented data model is the possibility to define object behavior, query optimization techniques in object-oriented database systems are mainly based on the structural part of objects. We claim that the optimization potential emerging from methods has been strongly underestimated so far. In this paper we concentrate on the question of how semantic knowledge about methods can be considered in query optimization. We rely on the algebraic and rule-based approach for query optimization and present a framework that allows to integrate schema-specific knowledge by tailoring the query optimizer according to the particular application's needs. We sketch an implementation of our concepts within the OODBMS VODAK using the Volcano optimizer generator." ICDE A Uniform Framework for Integrating Knowledge in Heterogeneous Knowledge Systems. Sibel Adali,Ross Emery 1995 Integrating knowledge from multiple sources is an important aspect of automated reasoning systems. Wiederhold and his colleagues (1993) have proposed the concept of a mediator-a device that will express how such an integration is to be achieved. In (1994) Subrahmanian et al. presented a uniform declarative and operational framework for mediators for amalgamating multiple knowledge bases and data structures (e.g. relational, object-oriented, spatial, and temporal structures) when these knowledge bases (possibly) contain inconsistencies, uncertainties, and nonmonotonic modes of negation. We specify the programming environment for this framework and show that it can be used to extract and integrate information obtained from different sources of data and resolve conflicts. We also show that it can be extended easily to integrate new knowledge bases. ICDE Flexible Relation: An Approach for Integrating Data from Multiple, Possibly Inconsistent Databases. Shailesh Agarwal,Arthur M. Keller,Gio Wiederhold,Krishna Saraswat 1995 In this work we address the problem of dealing with data inconsistencies while integrating data sets derived from multiple autonomous relational databases. The fundamental assumption in the classical relational model is that data is consistent and hence no support is provided for dealing with inconsistent data. Due to this limitation of the classical relational model, the semantics for detecting, representing, and manipulating inconsistent data have to be explicitly encoded in the applications by the application developer. In this paper, we propose the flexible relational model, which extends the classical relational model by providing support for inconsistent data. We present a flexible relation algebra, which provides semantics for database operations in the presence of potentially inconsistent data. Finally, we discuss issues raised for query optimization when the data may be inconsistent. ICDE An International Masters in Software Engineering: Experience and Prospects. Alberto Apostolico,Gianfranco Bilardi,Franco Bombi,Richard A. DeMillo 1995 Describes our experience with a newly-established international partnership between the Software Engineering Research Center (SERC), a university-based National Science Foundation (NSF) sponsored industrial research organization in the United States and an Italian industry-university team based in Padua, Italy. ICDE Mining Sequential Patterns. Rakesh Agrawal,Ramakrishnan Srikant 1995 We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction. ICDE Efficient Processing of Proximity Queries for Large Databases. Walid G. Aref,Daniel Barbará,Stephen Johnson,Sharad Mehrotra 1995 Emerging multimedia applications require database systems to provide support for new types of objects and to process queries that may have no parallel in traditional database applications. One such important class of queries are the proximity queries that aims to retrieve objects in the database that are related by a distance metric in a way that is specified by the query. The importance of proximity queries has earlier been realized in developing constructs for visual languages. In this paper, we present algorithms for answering a class of proximity queries-fixed-radius nearest-neighbor queries over point object. Processing proximity queries using existing query processing techniques results in high CPU and I/O costs. We develop new algorithms to answer proximity queries over objects that lie in the one-dimensional space (e.g., words in a document). The algorithms exploit query semantics to reduce the CPU and I/O costs, and hence improve performance. We also show how our algorithms can be generalized to handle d-dimensional objects. ICDE Design, Implementation and Evaluation of SCORE (a System for COntent based REtrieval of Pictures). Y. Alp Aslandogan,Chuck Thier,Clement T. Yu,Chengwen Liu,Krishnakumar R. Nair 1995 "We make use of a refined E-R model to represent the contents of pictures. We propose remedies to handle mismatches which may arise due to differences in perception of picture contents. An iconic user interface for visual query construction is presented. A naive user can specify his/her intention without learning a query language. A function which computes the similarity between a picture and a user's description is provided. Pictures which are sufficiently close to the user description, as measured by the similarity function, are retrieved. We present the results of a user-friendliness experiment to evaluate the user interface as well as retrieval effectiveness. Encouraging retrieval results and valuable lessons are obtained." ICDE A High Performance Configurable Storage Manager. Alexandros Biliris,Euthimios Panagos 1995 Presents the architecture of /spl Bscr/eSS (Bell Laboratories Storage System)-a high-performance configurable database storage manager providing key facilities for the fast development of object-oriented, relational or home-grown database management systems. /spl Bscr/eSS is based on a multi-client multi-server architecture offering distributed transaction management facilities and extensible support for persistence. We present some novel aspects of the /spl Bscr/eSS architecture, including a fast object reference technique that allows re-organization of databases without affecting existing references, and two operation modes that an application running on a client or server machine can use to interact with the storage system-(i) copy on access and (ii) shared memory. Subject Terms: storage management; client-server systems; transaction processing; object-oriented databases; distributed databases; relational databases; high-performance configurable database storage manager; BeSS; Bell Laboratories Storage System; object-oriented database management systems; relational database management systems; home-grown database management systems; multi-client multi-server architecture; distributed transaction management facilities; extensible support; persistence; fast object reference technique; database reorganization; operation modes; copy on access; shared memory ICDE "Title, Table of Contents, General Chairs' Message, Program Chairs' Message, Reviewers, Committees, Author Index." 1995 "Title, Table of Contents, General Chairs' Message, Program Chairs' Message, Reviewers, Committees, Author Index." ICDE Transactions in the Client-Server EOS Object Store. Alexandros Biliris,Euthimios Panagos 1995 The paper describes the client-server software architecture of the EOS storage manager and the concurrency control and recovery mechanisms it employs. Unlike most client-server storage systems that use the standard two-phase locking protocol, EOS offers a semi-optimistic locking scheme based on a multigranularity two-version two-phase locking protocol. Under this scheme, many readers are allowed to access a data item while it is being updated by a single writer. For recovery, EOS maintains a write-ahead redo-only log because of the potential benefits it offers in a client-server environment. First, there are no undo records, as log records of aborted transactions are never inserted in the log; this minimizes the I/O and network transfer costs associated with logging during normal transaction execution. Secondly, it reduces the space required for the log. Thirdly, it facilitates fast recovery from system crashes because only one forward scan of the log is required for installing the updates performed by transactions that committed prior to the crash. Performance results of the EOS recovery subsystem are also presented. ICDE Building an Integrated Active OODBMS: Requirements, Architecture, and Design Decisions. Alejandro P. Buchmann,Jürgen Zimmermann,José A. Blakeley,David L. Wells 1995 "Active OODBMSs must provide efficient support for event detection, composition and rule execution. Previous experience of building active capabilities on top of existing closed OODBMSs has proven to be ineffective. We propose instead an active OODBMS architecture where event detection and rule support are tightly integrated with the rest of the core OODBMS functionality. After presenting an analysis of the requirements of active OODBMSs, we discuss the event set, rule execution modes and lifespan of the events supported in our architecture. We also discuss event composition coupling relative to transaction boundaries. Since building an active OODBMS ex nihilo is extremely expensive, we are building the REACH (REal-time ACtive Heterogeneous) OODBMS by extending Texas Instruments' Open OODB toolkit. Open OODB is particularly well-suited for our purposes because it is the first DBMS whose architecture closely resembles the active database paradigm. It provides low-level event detection and invokes appropriate DBMS functionality as actions. We describe the architecture of the event detection and composition mechanisms, and the rule-firing process of the REACH active OODBMS, and show how these mechanisms interplay with the Open OODB core mechanisms." ICDE Toward Scalability and Interoperability of Heterogeneous Information Sources. Son Dao 1995 Future large and complex information systems create new challenges and opportunities for research and advanced development in data management. A brief description of Hughes research and prototype efforts to meet these challenges is summarized. ICDE Active Database Management of Global Data Integrity Constraints in Heterogeneous Database Environments. Lyman Do,Pamela Drew 1995 Today, enterprises maintain many disparate information sources over which complex business applications are expected. The informal and ad hoc characteristics of these environments make the information very prone to inconsistency. Yet, the flexibility of application execution given to different parts of an organization is desirable. This paper introduces a new mechanism in which the execution of asynchronous, pre-existing, yet related, applications can be harnessed. A multidatabase framework that supports the concurrent execution of these heterogeneous, distributed applications is presented. Using this framework, we introduce an intuitive conceptual model and algorithm for the enforcement of interdatabase constraints based on active database technology. ICDE ECA Rule Integration into an OODBMS: Architecture and Implementation. Sharma Chakravarthy,V. Krishnaprasad,Z. Tamizuddin,R. H. Badani 1995 Making a database system active entails not only the specification of expressive ECA (event-condition-action) rules, algorithms for the detection of composite events, and rule management, but also a viable architecture for rule execution that extends a passive DBMS, and its implementation. We propose an integrated active DBMS architecture for incorporating ECA rules using the Open OODB Toolkit (from Texas Instruments). We then describe the implementation of the composite event detector, and rule execution model for object-oriented active DBMS. Finally, the functionality supported by this architecture and its extensibility are analyzed along with the experiences gained. ICDE Modeling Scientific Experiments with an Object Data Model. I-Min A. Chen,Victor M. Markowitz 1995 We examine the main requirements for modeling scientific experiments and propose constructs that fulfil these requirements. We show that existing object-oriented and semantic data models do not provide such constructs. Experiment (protocol) and object constructs can be combined in order to provide seamless object and experiment modeling. We present an example of combining protocol and object constructs into a unified framework, the Object-Protocol Model (OPM), and briefly describe the implementation of an OPM interface on top of commercial relational database management systems (DBMSs). ICDE Improving SQL with Generalized Quantifiers. Ping-Yu Hsu,Douglas Stott Parker Jr. 1995 "A generalized quantifier is a particular kind of operator on sets. Coming under increasing attention recently by linguists and logicians, they correspond to many useful natural language phrases, including phrases like: three, Chamberlin's three, more than three, fewer than three, at most three, all but three, no more than three, not more than half the, at least two and not more than three, no student's, most male and all female, etc. Reasoning about quantifiers is a source of recurring problems for most SQL users, and leads to both confusion and incorrect expression of queries. By adopting a more modern and natural model of quantification these problems can be alleviated. We show how generalized quantifiers can be used to improve the SQL interface." ICDE Optimizing Queries with Materialized Views. Surajit Chaudhuri,Ravi Krishnamurthy,Spyros Potamianos,Kyuseok Shim 1995 While much work has addressed the problem of maintaining materialized views, the important question of optimizing queries in the presence of materialised views has not been resolved. In this paper, we analyze the optimization question and provide a comprehensive and efficient solution. Our solution has the desirable property that it is a simple generalization of the traditional query optimization algorithm. ICDE The Design and Implementation of a Full-Fledged Multiple DBMS. Shu-Chin Su Chen,Chih-Shing Yu,Yen-Yao Yao,San-Yih Hwang,B. Paul Lin 1995 We have described our design of the multiple DBMS (MDBMS). This MDBMS enables users to access data controlled by different DBMSs as if data were managed by a single DBMS. It supports facilities for SQL queries and transactions, and considers security functions. In addition, an ODBC driver at the client site has been realized to ease the development of MDBMS applications. Several popular commercial DBMSs, including Oracle, Informix and Sybase, have been successfully integrated. The MDBMS is in operation now. However, we found the performance to be unsatisfactory. It took about several seconds to process an SQL query with single join on two relations of hundreds of tuples. We have identified the performance bottleneck to be on the retrieval of meta data. The current MDBMS Server employs a commercial DBMS to store meta data, which is necessary for processing a global query. The processing of a query is slow because it needs to retrieve the schema information via an external DBMS several times. We are currently designing a core storage manager and an access manager specifically for maintaining the meta data and the intermediate results of a global query. We expect this design to significantly improve the performance. ICDE Scalable Parallel Query Server for Decision Support Applications. Jen-Yao Chung 1995 Decision-support applications require the ability to query against large amounts of detailed historical data. We are exploiting parallel technology to improve query response time through query decomposition, CPU and I/O parallelism, and client/server approach. IBM System/390 Parallel Query Server is built on advanced and low-cost CMOS microprocessors for decision-support applications. We discuss our design, implementation and performance of a scalable parallel query server. ICDE Praire: A Rule Specification Framework for Query Optimizers. Dinesh Das,Don S. Batory 1995 From our experience, current rule-based query optimizers do not provide a very intuitive and well-defined framework to define rules and actions. To remedy this situation, we propose an extensible and structured algebraic framework called Prairie for specifying rules. Prairie facilitates rule-writing by enabling a user to write rules and actions more quickly, correctly and in an easy-to-understand and easy-to-debug manner. Query optimizers consist of three major parts: a search space, a cost model and a search strategy. The approach we take is only to develop the algebra which defines the search space and the cost model and use the Volcano optimizer-generator as our search engine. Using Prairie as a front-end we translate Prairie rules to Volcano to validate our claim that Prairie makes it easier to write rules. We describe our algebra and present experimental results which show that using a high-level framework like Prairie to design large-scale optimizers does not sacrifice efficiency. ICDE An Object-Oriented Conceptual Modeling of Video Data. Young Francis Day,Serhan Dagtas,Mitsutoshi Iino,Ashfaq A. Khokhar,Arif Ghafoor 1995 We propose a graphical data model for specifying spatio-temporal semantics of video data. The proposed model segments a video clip into subsegments consisting of objects. Each object is detected and recognized, and the relevant information of each object is recorded. The motions of objects are modeled through their relative spatial relationships as time evolves. Based on the semantics provided by this model, a user can create his/her own, object-oriented view of the video database. Using the propositional logic, we describe a methodology for specifying conceptual queries involving spatio-temporal semantics and expressing views for retrieving various video clips. Alternatively, a user can sketch the query, by exemplifying the concept. The proposed methodology can be used to specify spatio-temporal concepts at various levels of information granularity. ICDE Locking in OODBMS Client Supported Nestd Transactions. Laurent Daynès,Olivier Gruber,Patrick Valduriez 1995 Nested transactions facilitate the control of complex persistent applications by enabling both fine-tuning of the scope of rollback and safe intra-transaction parallelism. We are concerned with supporting concurrent nested transactions on client workstations of an OODBMS. Use of the traditional design and implementation of a lock manager results in a high CPU overhead: in-cache traversals of the 007 benchmark perform, at best, 4.5 times slower than the same traversal achieved in virtual memory by a nonpersistent programming language. We propose a new design and implementation of a lock manager which cuts that factor down to 1.8. This lock manager supports nested transactions with both sibling and parent/child parallelisms, and provides object locking at a cost comparable to page locking. Object locking is therefore a better alternative due to its higher functionality. ICDE Enterprise Workflow Architecture. Weimin Du,Steve Peterson,Ming-Chien Shan 1995 "Workflow builders are designed to facilitate development of automated processes and support flexible applications that can be updated, enhanced or completely revamped. The Hewlett-Packard WorkManager is an open product data management solution with workflow management capabilities. WorkManager supports the entire product lifecycle by providing a single, logical repository for all data, and it manages and tracks enterprise-wide processes. With a strong information management platform at its core, WorkManager provides central administration capabilities, including supervision and intervention, where necessary. Because enterprise data is usually fragmented and stored in a variety of legacy systems, and different organizations have different amount of control over their data, an enterprise workflow system needs to support processes accessing data from different sites and applications. This paper describes the architecture of distributed workflow, Hewlett-Packard's solution to the enterprise workflow problem. The architecture is an extension of the existing WorkManager architecture. Its development is based on user requirements and four high-level user models. The user models and the architecture are described." ICDE Practical Issues for RDBMS Application Development. Kwo-Jean Farn,Shin-Ling Hu 1995 "We discuss some functions of relational database management systems (RDBMS) that may help RDBMS users to increase their productivity. First, for a data dictionary, we can define each field attribute characteristic in the create table statement, such as signed, unsigned, negative, nonnegative, list or coded value, range value, default value; uppercase, lowercase or upperlow case; IDstamp value; datestamp value; or computation field. We also point out some inconvenient functions of RDBMS. A more intelligent query optimizer is also needed. For users in a Chinese environment, Chinese characteristics such as field name defining, sorting, and partial searching are required. In addition, for an area center using a horizontal fragmentation scheme, a tool which which can automatically parallel update the other site when the central site's kernel part changes is required." ICDE A Structure Based Schema Integration Methodology. Manuel García-Solaco,Fèlix Saltor,Malú Castellanos 1995 The process of integrating the schemas of several databases into an integrated schema is not easy, due to semantic heterogeneities. We present a method/sup 1/ to detect class similarities by following a strategy and applying comparison criteria, that exploits the semantically rich structures of the schemas (previously enriched), along both the generalization/specialization and the aggregation dimensions. Relaxations may be applied to conform a pair of classes, resulting in penalizations in the computation of the degree of similarity. Our approach needs less comparisons than methods based on attribute comparison. ICDE Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics. Peter J. Haas,Arun N. Swami 1995 "We compare empirically the cost of estimating the selectivity of a star join using the sampling-based t-cross procedure to the cost of computing the join and obtaining the exact answer. The relative cost of sampling can be excessive when a join attribute value exhibits ""heterogeneous skew."" To alleviate this problem, we propose Algorithm TCM, a modified version of t-cross that incorporates ""augmented frequent value"" (AFV) statistics. We provide a sampling-based method for estimating AFV statistics that does not require indexes on attribute values, requires only one pass though each relation, and uses an amount of memory much smaller than the size of a relation. Our experiments show that the use of estimated AFV statistics can reduce the relative cost of sampling by orders of magnitude. We also show that use of estimated AFV statistics can reduce the relative error of the classical System R selectivity formula." ICDE Two-Level Caching of Composite Object Views of Relational Databases. Catherine Hamon,Arthur M. Keller 1995 We describe a two-level client-side cache for composite objects mapped as views of a relational database. A semantic model, the Structural Model, is used to specify joins on the relational database that are useful for defining composite objects. The lower level of the cache contains the tuples from each relation that have already been loaded into memory. These tuples are linked together from relation to relation according to the joins of the structural model. This level of the cache is shared among all applications using the data on this client. The higher level of the cache contains composed objects of data extracted from the lower level cache. This level of the cache uses the object schema of a single application, and the data is copied from the lower level cache for convenient access by the application. This two-level cache is designed as part of the Penguin system, which supports multiple applications, each with its own object schema, to share data stored in a common relational database. ICDE An Industrial Perspective of Software Architecture. Christine Hofmeister,Robert L. Nord,Dilip Soni 1995 The software architecture of a system describes how it is decomposed into components, how these components are interconnected, and how they communicate and interact with each other and with the environment. Software architecture represents critical, system-wide design decisions which affect quality, reconfigurability and reuse, and the cost for development and maintenance. In order to understand architecture as it is practised in the real world, we conducted a survey of a variety of industrial software systems. Our survey revealed the need for rigorous descriptions, systematic techniques, and well-defined processes to make architecture-level software development an engineering practice rather than an art. ICDE Set-Oriented Mining for Association Rules in Relational Databases. Maurice A. W. Houtsma,Arun N. Swami 1995 Describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss the optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. SETM uses only simple database primitives, viz. sorting and merge-scan join. SETM is simple, fast and stable over the range of parameter values. The major contribution of this paper is that it shows that at least some aspects of data mining can be carried out by using general query languages such as SQL, rather than by developing specialized black-box algorithms. The set-oriented nature of SETM facilitates the development of extensions. ICDE Navigation Server: A Highly Parallel DBMS on Open Systems. Ron-Chung Hu,Rick Stellwagen 1995 "Navigation Server was jointly developed to provide a highly scalable, high-performance parallel database server in the industry. By combining ATandT's experience in massively parallel systems, such as Teradata system, with Sybase's industry-leading open, client/server DBMS, Navigation Server was developed with some specific design objectives: Scalability. Minimizing interference by minimizing resource sharing among the concurrent processes, the shared-nothing architecture has, as of today, emerged as the architecture of choice for highly scalable parallel systems. Navigation Server adopts the shared-nothing parallel architecture to allow parallelized queries, updates, load, backup, and other utilities on a partitioned database. Portability Built on top of Sybase's open system products, Navigation Server is portable to Unix-based parallel machines. Further the shared-nothing software architecture demands minimal changes when porting Navigation Server to various parallel platforms ranging from symmetric multi-processing, clustered, to massively parallel processing systems. Availability. For a parallel system with many nodes, it may be often to see some hardware component failure. To achieve high availability, Navigation Server implements a hierarchical monitoring scheme to monitor all the running processes. With the monitoring frequency configurable by users, a process will be restarted automatically on an alternate node once a failure is detected. Usability. Navigation Server appears as a single Sybase SQL server to end users. Besides, it provides Sybase SQL Server two management tools: Configurator and Navigation Server Manager. The Configurator analyzes customers' workload, monitors system performance, and recommends configurations for optimal performance and resource utilization. The Navigation Server Manager provides graphical utilities to administer the system simply and efficiently." ICDE A Performance Evaluation of Load Balancing Techniques for Join Operations on Multicomputer Database Systems. Kien A. Hua,Wallapak Tavanapong,Honesty C. Young 1995 There has been a wealth of research in the area of parallel join algorithms. Among them, hash-based algorithms are particularly suitable for shared-nothing database systems. The effectiveness of these techniques depends on the uniformity in the distribution of the join attribute values. When this condition is not met, a severe fluctuation may occur among the bucket sizes, causing uneven workload for the processing nodes. Many parallel join algorithms with load balancing capability have been proposed to address this problem. Among them, the sampling and incremental approaches have been shown to provide an improvement over the more conventional methods. The comparison between these two approaches, however, has not been investigated. In this paper, we improve these techniques and implement them on an nCUBE/2 parallel computer to compare their performance. Our study indicates that the sampling technique is the better approach. ICDE Record Subtyping in Flexible Relations by Means of Attribute Dependencies. Christian Kalus,Peter Dadam 1995 "The model of flexible relations supports heterogeneous sets of tuples in a strongly typed way. The elegance of the standard relational model is presented by using a single, generic scheme constructor. In each model supporting structural variants the shape of some part of a heterogeneous scheme may be determined by the contents of some other part of the scheme. We formalize this relationship by a certain kind of integrity constraint we have called ""attribute dependency"" (AD). We motivate how ADs can be used, besides their application in type and integrity checking, to incorporate record subtyping into our extended relational model. Moreover, we show that ADs yield a stronger assertion than the traditional record subtyping rub as they consider interdependencies among refinements. We discuss how ADs are related to query processing and how they may help to identify redundant operations." ICDE Bottom-Up Evaluation of Logic Programs Using Binary Decision Diagrams. Mizuho Iwaihara,Yusaku Inoue 1995 Binary decision diagram (BDD) is a data structure to manipulate Boolean functions and recognized as a powerful tool in the VLSI CAD area. We consider that compactness and efficient operations of BDDs can be utilized for storing temporary relations in bottom-up evaluation of logic queries. We show two methods of encoding relations into BDDs, called logarithmic encoding and linear encoding, define relational operations on BDDs and discuss optimizations in ordering BDD variables to construct memory and time efficient BDDs. Our experiments show that our BDD-based bottom-up evaluator has remarkable performance against traditional hash table-based methods for transitive closure queries on dense graphs. ICDE A Version Numbering Scheme with a Useful Lexicographical Order. Arthur M. Keller,Jeffrey D. Ullman 1995 We describe a numbering scheme for versions with alternatives that has a useful lexicographical ordering. The version hierarchy is a tree. By inspection of the version numbers, we can easily determine whether one version is an ancestor of another. If so, we can determine the version sequence between these two versions. If not, we can determine the most recent common ancestor to these two versions (i.e., the least upper bound, lub). Sorting the version numbers lexicographically results in a version being followed by all descendants and preceded by all its ancestors. We use a representation of nonnegative integers that is self delimiting and whose lexicographical ordering matches the ordering by value. ICDE Computing Temporal Aggregates. Nick Kline,Richard T. Snodgrass 1995 Aggregate computation, such as selecting the minimum attribute value of a relation, is expensive, especially in a temporal database. We describe the basic techniques behind computing aggregates in conventional databases and show that these techniques are not efficient when applied to temporal databases. We examine the problem of computing constant intervals (intervals of time for which the aggregate value is constant) used for temporal grouping. We introduce two new algorithms for computing temporal aggregates: the aggregation tree and the k-ordered aggregation tree. An empirical comparison demonstrates that the choice of algorithm depends in part on the amount of memory available, the number of tuples in the underlying relation, and the degree to which the tuples are ordered. This study shows that the simplest strategy is to first sort the underlying relation, then apply the k-ordered aggregation tree algorithm with k=1. ICDE Infobusiness Issues in ROC. Lung-Lung Liu 1995 The infobusiness operation has been popular for many years in the ROC. Management information systems in the government, military, and enterprise were the original applications, and then came the information service requirement from various kinds of users. Computer networks, database systems, and information providers together proposed the draft infobusiness environment. Closed systems are still the ones that major infobusiness operations provide to their customers. Issues in the infobusiness development include: (1) closed systems limited the infobusiness opportunity. (2) Chinese character handling and the inconvenient localized environment blocked the user and the vendor in information service applications. (3) Public computer networks are not popular, hence the add-on value of infobusiness is invisible. (4) Large database handling experience is not available. These issues concern techniques, standards, and even laws. For example, the open system concept is generally acceptable, but it is usually too vague for the public. Open databases still do not talk smoothly to one another, especially when different operating systems on networks are trying to exchange Chinese information. Another factor is that the total number of standard Chinese characters is still in negotiation internationally although applications have been practised for 20 years on computers. In order to unify the number of Chinese characters, there is discussion about whether laws are necessary to define a formal discipline for creating new Chinese characters. ICDE A Cost-effective Near-line Storage Server for Multimedia System. Siu-Wah Lau,John C. S. Lui,P. C. Wong 1995 We consider a storage server architecture for multimedia information systems. While most other works on multimedia storage servers assume on-line disk storage, we consider a two-tier storage architecture with a robotic tape library as the vast near-line storage and on-line disks as the front-line storage. Magnetic tapes are cheaper, more robust, and have a larger capacity; hence they are more cost effective for large scale storage systems (e.g., video on demand (VOD) systems may store tens of thousands of videos). We study in detail the design issues of the tape subsystem and propose some novel tape scheduling algorithms which give faster response and require less disk buffering. ICDE Pushing Semantics Inside Recursion: A General Framework for Semantic Optimization of Recursive Queries. Laks V. S. Lakshmanan,Rokia Missaoui 1995 We consider a class of linear query programs and integrity constraints and develop methods for (i) computing the residues and (ii) pushing them inside the recursive programs, minimizing redundant computation and run-time overhead. We also discuss applications of our strategy to intelligent query answering. ICDE RBE: Rendering By Example. Ravi Krishnamurthy,Moshé M. Zloof 1995 Rendering is defined to be a customized presentation of data in such a way that allows users to subsequently interact with the presented data. Traditionally such a user interface would be a custom application written using conventional programming languages; in contrast we propose an application-independent, declarative (i.e., what-you-want) language that we call Rendering By Example, RBE, with the capability to specify a wide variety of renderings. RBE is a domain calculus language over user interface widgets. Most previous domain calculus database languages (e.g., QBE, LDL, Datalog) mainly addressed the data processing problem. The main contribution in developing RBE is to model semantics of user interactions in a declarative way. This declarative specification not only allows quick and ad-hoc specification of renderings (i.e., user interfaces) but also provides a framework to understand renderings as an abstract concept, independent of the application. Further, such a linguistic abstraction provides the basis for user-interface research. RBE is part of the ICBE language that is being prototyped in the Picture Programming project at HP Labs. ICDE An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems. Yibei Ling,Wei Sun 1995 The results of a performance study of the representative sampling-based size estimation methods in database management systems are reported in this paper. Major performance measurement includes estimation accuracy, the amount of sample taken, and the coverage. The impact of skewed data on the performance is also discussed. These results allow a better understanding and assessment of sampling estimation methods and determine the suitability of different methods under different situations. ICDE A Similarity Graph-Based Approach to Declustering Problems and Its Application towards Paralleling Grid Files. Duen-Ren Liu,Shashi Shekhar 1995 We propose a new similarity-based technique for declustering data. The proposed method can adapt to available information about query distributions, data distributions, data sizes and partition-size constraints. The method is based on max-cut partitioning of a similarity graph defined over the given set of data, under constraints on the partition sizes. It maximizes the chances that a pair of data-items that are to be accessed together by queries are allocated to distinct disks. We show that the proposed method can achieve optimal speed-up for a query-set, if there exists any other declustering method which will achieve the optimal speed-up. Experiments in parallelizing grid files show that the proposed method outperforms mapping-function-based methods for interesting query distributions as well for non-uniform data distributions. ICDE A Trace-Based Simulation of Pointer Swizzling Techniques. Mark L. McAuliffe,Marvin H. Solomon 1995 "Persistent object-oriented applications that traverse large object graphs can improve their performance by caching objects in main memory while they are being used. While caching offers large performance benefits, the techniques used to locate these cached objects in memory can still impede the application's performance. We present the results of a trace-based simulation study of pointer swizzling techniques (techniques for reducing the cost of access to cached objects). We used traces derived from actual persistent programs to find a class of swizzling techniques that performs well, yet permits changes to the contents of in-memory object caches over the lifetime of an application. Our study demonstrates the superiority of a class of techniques known as ""indirect swizzling"" for a variety of workloads and system configurations." ICDE The Design and Experimental Evaluation of an Information Discovery Mechanism for Networks of Autonomous Database Systems. Dennis McLeod,Antonio Si 1995 An approach and mechanism to support the dynamic discovery of information units within a collection of autonomous and heterogeneous database systems is described. The mechanism is based upon a core set of database constructs that characterizes object database systems, along with a set of self-adaptive heuristics employing techniques from machine learning. The approach provides an uniform framework for organizing, indexing, searching, and browsing database information units within an environment of multiple, autonomous, interconnected databases. The feasibility of the approach and mechanism is illustrated using a protein/genetics application environment. Metrics for measuring the performance of the discovery system are presented and the effectiveness of the system is thereby evaluated. Performance tradeoffs are examined and analyzed by experiments performed, employing a simulation model. ICDE Disk Read-Write Optimizations and Data Integrity in Transaction Systems Using Write-Ahead Logging. C. Mohan 1995 "We discuss several disk read-write optimizations that are implemented in different transaction systems and disk hardware to improve performance. These include: (1) when multiple sectors are written to disk, the sectors may be written out of sequence (SCSI disk interfaces do this). (2) Avoiding initializing pages on disk when a file is extended. (3) Not accessing individual pages during a mass delete operation (e.g., dropping an index from a file which contains multiple indexes). (4) Permitting a previously deallocated page to be reallocated without the need to read the deallocated version of the page from disk during its reallocation. (5) Purging of file pages from the buffer pool during a file erase operation (e.g., a table drop). (6) Avoiding logging for bulk operations like index create. We consider a system which implements the above optimizations and in which a page consists of multiple disk sectors and recovery is based on write-ahead logging using a log sequence number on every page. For such a system, we present a simple method for guaranteeing the detection of the partial disk write of a page. Detecting partial writes is very important not only to ensure data integrity from the users' viewpoint but also to make the transaction system software work correctly. Once a partial write is detected, it is easy to recover such a page using media recovery techniques. Our method imposes minimal CPU and space overheads. It has been implemented in DB2/6000 and ADSM." ICDE A Transaction Transformation Approach to Active Rule Processing. Danilo Montesi,Riccardo Torlone 1995 Describes operational aspects of a novel approach to active rule processing based on a transaction transformation technique. A user-defined transaction, which is viewed as a sequence of atomic database updates forming a semantic unit, is translated by means of active rules into a new transaction that explicitly includes the additional updates due to active rule processing. It follows that the execution of the new transaction in a passive environment corresponds to the execution of the original transaction within the active environment defined by the given rules. Both immediate and deferred execution models are considered. The approach presents two main features. First, it relies on a well known formal basis that allow us to derive solid results on equivalence, confluence and optimization issues. Second, it is easy to implement as it does not require any specific run-time support. ICDE OCAM: A Collaborative System for Multimedia Applications. Hassan Mountassir,S. Serre 1995 Software engineering tasks as design and programming require the concurrent participation of multiple users, possibly geographically dispersed. But traditional software environments have not been designed to facilitate collaborative work. In this paper we present briefly a tool for communication among many persons in which users can elaborate documents with synchronous and asynchronous interaction modes. ICDE Relational Database Compression Using Augmented Vector Quantization. Wee Keong Ng,Chinya V. Ravishankar 1995 Data compression is one way to alleviate the I/O bottleneck problem faced by I/O-intensive applications such as databases. However, this approach is not widely used because of the lack of suitable database compression techniques. In this paper, we design and implement a novel database compression technique based on vector quantization (VQ). VQ is a data compression technique with wide applicability in speech and image coding, but it is not directly suitable for databases because it is lossy. We show how one may use a lossless version of vector quantization to reduce database space storage requirements and improve disk I/O bandwidth. ICDE A Transparent Object-Oriented Schema Change Approach Using View Evolution. Young-Gook Ra,Elke A. Rundensteiner 1995 When a database is shared by many users, updates to the database schema are almost always prohibited because there is a risk of making existing application programs obsolete when they run against the modified schema. This paper addresses the problem by integrating schema evolution with view facilities. When new requirements necessitate schema updates for a particular user, the user specifies schema changes to the personal view rather than to the shared base schema. Our view evolution approach then computes a new view schema that reflects the semantics of the desired schema change, and replaces the old view with the new one. We present algorithms that implement the set of schema evolution operations typically supported by OODB systems as view definitions. This approach provides the means for schema change without affecting other views (and thus without affecting existing application programs). The persistent data is shared by different views of the schema, i.e., both old as well as newly developed applications can continue to interoperate. In this paper, we present examples that demonstrate our approach. ICDE Design of Multimedia Storage Systems for On-Demand Playback. Yen-Jen Oyang,Meng-Huang Lee,Chun-Hung Wen,Chih-Yuan Cheng 1995 This paper presents a comprehensive procedure to design multimedia storage systems for on-demand playback. The design stresses effective utilization of disk bandwidth with minimal data buffer to minimize overall system costs. The design procedure is most distinctive in the following two aspects: it bases on a tight upper bound of the lumped disk seek time for the Scan disk scheduling algorithm to achieve effective utilization of disk bandwidth; it starts with a general two-level hierarchical disk array structure to derive the optimal configuration for specific requirements. ICDE Object Exchange Across Heterogeneous Information Sources. Yannis Papakonstantinou,Hector Garcia-Molina,Jennifer Widom 1995 We address the problem of providing integrated access to diverse and dynamic information sources. We explain how this problem differs from the traditional database integration problem and we focus on one aspect of the information integration problem, namely information exchange. We define an object-based information exchange model and a corresponding query language that we believe are well suited for integration of diverse information sources. We describe how, the model and language have been used to integrate heterogeneous bibliographic information sources. We also describe two general-purpose libraries we have implemented for object exchange between clients and servers. ICDE Deputy Mechanisms for Object-Oriented Databases. Zhiyong Peng,Yahiko Kambayashi 1995 Concepts of deputy objects and deputy classes for object-oriented databases (OODBs) are introduced. They can be used for unified realization of object views, roles and migration. The previous researches on these concepts were carried out separately, although they are very closely related. Objects appearing in a view can be regarded as playing roles in that view. Object migration is caused by change of roles of an object. Deputy objects can be used for unified treatment of them and generalization of these concepts. The schemata of deputy objects are defined by deputy classes. A set of algebraic operations are developed for deputy class derivation. In addition, three procedures for update propagation between deputy objects and source objects have been designed, which can support dynamic classification. The unified realization of object views, roles and migration by deputy mechanisms can achieve the following advantages. (1) Treating view objects as roles of an object allows them to have additional attributes and methods so that the autonomous views suitable for OODBs can be realized. (2) Handling object roles in the same way as object views enables object migration to be easily realized by dynamic classification functions of object views. (3) Generalization of object views, roles and migration makes it possible that various semantic constraints on them can, be defined and enforced uniformly. ICDE Axiomatization of Dynamic Schema Evolution in Objectbases. Randal J. Peters,M. Tamer Özsu 1995 Axiomatization of Dynamic Schema Evolution in Objectbases. ICDE Query Interoperation Among Object-Oriented and Relational Databases. Xiaolei Qian,Louiqa Raschid 1995 We develop an efficient algorithm for the query interoperation among existing heterogeneous object-oriented and relational databases. Our algorithm utilizes a canonical deductive database as a uniform representation of object-oriented schema and data. High-order object queries are transformed to the canonical deductive database in which they are partially evaluated and optimized, before being translated to relational queries. Our algorithm can be incorporated into object-oriented interfaces to relational databases or object-oriented federated databases to support object queries to heterogeneous relational databases. ICDE Buffer Management for Video Database Systems. Doron Rotem,J. Leon Zhao 1995 Future multimedia information systems are likely to manage thousands of videos with various lengths and display requirements. Mismatch of playback and delivery rates of compressed video data requires sophisticated buffer management algorithms to guarantee smooth playback of video data. In this paper, we address some of the many design and operational issues including buffer size requirements, refreshing policies, and support of multiple access points to the same video object. Three different buffer management strategies are proposed and analyzed to minimize the average waiting time while ensuring display without jerkiness. We also evaluate the effectiveness these buffer management strategies with a simulation study. ICDE SEQ: A Model for Sequence Databases. Praveen Seshadri,Miron Livny,Raghu Ramakrishnan 1995 This paper presents the SEQ model which is the basis for a system to manage various kinds of sequence data. The model separates the data from the ordering information, and includes operators based on two distinct abstractions of a sequence. The main contributions of the SEQ model are: (a) it can deal with different types of sequence data, (b) it supports an expressive range of sequence queries, (c) it draws from many of the diverse existing approaches to modeling sequence data. ICDE Generalized Partial Indexes. Praveen Seshadri,Arun N. Swami 1995 This paper demonstrates the use of generalized partial indexes for efficient query processing. We propose that partial indexes be built on those portions of the database that are statistically likely to be the most useful for query processing. We identify three classes of statistical information, and two levels at which it may be available. We describe indexing strategies that use this information to significantly improve average query performance. Results from simulation experiments demonstrate that the proposed generalized partial indexing strategies perform well compared to the traditional approach to indexing. ICDE CCAM: A Connectivity-Clustered Access Method for Aggregate Queries on Transportation Networks: A Summary of Results. Shashi Shekhar,Duen-Ren Liu 1995 CCAM is an access method for general networks. It uses connectivity clustering. The nodes of the network are assigned to disk pages via the graph partitioning approach to maximize the CRR, i.e., the chances that a pair of connected nodes are allocated to a common page of the file. CCAM supports the operations of insert, delete, create, and find as well as the new operations, get-A-successor and get-successors, which retrieve one or all successors of a node to facilitate aggregate computations on networks. CCAM includes methods for static clustering, as well as dynamic incremental reclustering, to maintain, high CRR, in the face of updates without incurring high overheads. Experimental analysis indicates that CCAM can outperform many other access methods for network operations. ICDE Ternary Relationship Decomposition Strategies Based on Binary Imposition Rules. Il-Yeol Song,Trevor H. Jones 1995 We review a set of rules identifying which combinations of ternary and binary relationships can be combined simultaneously in semantically related situations. We investigate the effect of these rules on decomposing ternary relationships to simpler, multiple binary relationships. We also discuss the relevance of these decomposition strategies to ER modeling. We show that if at least one 1:1 or 1:M binary constraint can be identified within the construct of the ternary itself, then any ternary relationship can be decomposed to a binary format. From this methodology we construct a heuristic-the Constrained Ternary Decomposition (CTD) rule. ICDE The AQUA Approach to Querying Lists and Trees in Object-Oriented Databases. Bharathi Subramanian,Theodore W. Leung,Scott L. Vandenberg,Stanley B. Zdonik 1995 Relational database systems and most object-oriented database systems provide support for queries. Usually these queries represent retrievals over sets or multisets. Many new applications for databases, such as multimedia systems and digital libraries, need support for queries on complex bulk types such as lists and trees. In this paper we describe an object-oriented query algebra called AQUA (= A Query Algebra) for lists and trees. The operators in the algebra preserve the ordering between the elements of a list or tree, even when the result list or tree contains an arbitrary set of nodes from the original tree. We also present predicate languages for lists and trees which allow order-sensitive queries because they use pattern matching to examine groups of list or tree nodes rather than individual nodes. The ability to decompose predicate patterns enables optimizations that make use of indices. ICDE A New Recursive Subclass of Domain Independent Formulas Based on Subimplication. Joonyeoub Sung,Lawrence J. Henschen 1995 We motivate and define subimplication completion of a relational calculus query and of a general deductive database. Subimplication completion not only avoids getting unexpected answers, but also makes some domain dependent queries and databases domain independent. We define a new recursive subclass of domain independent formulas, called weakly range-restricted formulas, which is strictly larger than the class of range-restricted formulas. We also define admissible and deductive databases and show that under the subimplication completion they are domain independent and safe. ICDE A Heuristic Information Retrieval Model on a Massively Parallel Processor. Inien Syu,Sheau-Dong Lang,Kien A. Hua 1995 "We adapt a competition-based connectionist model to information retrieval. This model, which has been proposed for diagnostic problem solving, treats documents as ""disorders"" and user information needs as ""manifestations"", and it uses a competitive activation mechanism which converges to a set of disorders that best explain the given manifestations. Our experimental results using four standard document collections demonstrate the efficiency and the retrieval precision of this model, comparable to or better than that of various information retrieval models reported in the literature. We also propose a parallel implementation of the model on a SIMD machine, MasPar's MP-I. Our experimental results demonstrate the potential to achieve significant speedups." ICDE A Common Framework for Classifying and Specifying Deductive Database Updating Problems. Ernest Teniente,Toni Urpí 1995 We propose two interpretations of the event rules which provide a common framework for classifying and specifying deductive database updating problems such as view updating, materialized view maintenance, integrity constraints checking, integrity constraints maintenance, repairing inconsistent databases, integrity constraints satisfiability or condition monitoring. Moreover, these interpretations allow us to identify and to specify some problems that have received little attention up to now like enforcing or preventing condition activation. By considering only a unique set of rules for specifying all these problems, we want to show that it is possible to provide general methods able to deal with all these problems as a whole ICDE The Impact of Data Placement on Memory Management for Multi-Server OODBMS. Shivakumar Venkataraman,Miron Livny,Jeffrey F. Naughton 1995 We demonstrate the close relationship between data placement and memory management for symmetric multi-server OODBMS. We propose and investigate memory management algorithms for two data placement strategies, namely declustering and clustering. Through a detailed simulation, we show that by declustering the data most of the benefits of complex global memory management algorithms are realized by simple algorithms. In contrast we show that when data is clustered, the simple algorithms perform poorly. ICDE Translation of Object-Oriented Queries to Relational Queries. Clement T. Yu,Yi Zhang,Weiyi Meng,Won Kim,Gaoming Wang,Tracy Pham,Son Dao 1995 Proposes a formal approach for translating OODB queries to equivalent relational queries. The translation is accomplished through the use of relational predicate graphs and OODB predicate graphs. One advantage of using such a graph-based approach is that we can achieve bidirectional translation between relational queries and OODB queries. ICDE Efficient Processing of Nested Fuzzy SQL Queries. Qi Yang,Chengwen Liu,Jing Wu,Clement T. Yu,Son Dao,Hiroshi Nakajima 1995 Fuzzy databases have been introduced to deal with uncertain or incomplete information in many applications. The efficiency of processing fuzzy queries in fuzzy databases is a major concern. We provide techniques to unnest nested fuzzy queries of two blocks in fuzzy databases. We show both theoretically and experimentally that unnesting improves the performance of nested queries significantly. The results obtained in the paper form the basis for unnesting fuzzy queries of arbitrary blocks in fuzzy databases. ICDE Context-Dependent Interpretations of Linguistic Terms in Fuzzy Relational Databases. Weining Zhang,Clement T. Yu,Bryan Reagan,Hiroshi Nakajima 1995 Approaches are proposed to allow fuzzy terms to be interpreted according to the context within which they are used. Such an interpretation is natural and useful. A query-dependent interpretation is proposed to allow a fuzzy term to be interpreted relative to a partial answer of a query. A scaling process is used to transform a pre-defined meaning of a fuzzy term into on appropriate meaning in the given context. Sufficient conditions are given for a nested fuzzy query with RELATIVE quantifiers to be unnested for an efficient evaluation. An attribute-dependent interpretation is proposed to model the applications in which the meaning of a fuzzy term in an attribute must be interpreted with respect to values in other related attributes. Two necessary and sufficient conditions for a tuple to have a unique attribute-dependent interpretation are provided. We describe an interpretation system that allows queries to be processed based on the attribute-dependent interpretation of the data. Two techniques, grouping and shifting, are proposed to improve the implementation. ICDE A Universal Relation Approach to Federated Database Management. J. Leon Zhao,Arie Segev,Abhirup Chatterjee 1995 We describe a manufacturing environment where, driven by market forces, organizations cooperate as well as compete with one another. We argue that a federated database system (FDBS) is appropriate for such an environment. Contrary to conventional wisdom, complete transparency, assumed desirable and mandatory in distributed database systems, is neither desirable nor feasible in this environment. We propose a new approach that is based on schema coordination rather than integration under which each component database is free to change its data structure, attribute naming, and data semantics. A federated metadata model based on the notion of universal relation is introduced for the FDBS. We also develop the query processing paradigm, and present procedures for query transformation and heterogeneity resolution. ICDE Singapore NII : Building the Electronic Universe. Michael Yap 1995 The National Information Infrastructure (NII) is an infrastructure consisting of efficient transport, information processing and service facilities that combine both computer and communication technologies. The needs of business and the public in general drive the definition of the infrastructure. Its goal is to increase the well-being of people as a whole. To deliver the promise of a more effective way to do business, the NII must strive to bring as many of these services or their equivalent to the end-users. Further, these services must be easily accessible and easy to use in addition to being affordable. The NII attempts to (re)engineer the real-world capabilities over the physical telecommunication network. In addition, the NII provides for a rich set of common computing services and supports the reuse of large components, over and above providing a physical telecommunication network. SIGMOD Conference The Handwritten Trie: Indexing Electronic Ink. Walid G. Aref,Daniel Barbará,Padmavathi Vallabhaneni 1995 The emergence of the pen as the main interface device for personal digital assistants and pen-computers has made handwritten text, and more generally ink, a first-class object. As for any other type of data, the need of retrieval is a prevailing one. Retrieval of handwritten text is more difficult than that of conventional data since it is necessary to identify a handwritten word given slightly different variations in its shape. The current way of addressing this is by using handwriting recognition, which is prone to errors and limits the expressiveness of ink. Alternatively, one can retrieve from the database handwritten words that are similar to a query handwritten word using techniques borrowed from pattern and speech recognition. In particular, Hidden Markov Models (HMM) can be used as representatives of the handwritten words in the database. However, using HMM techniques to match the input against every item in the database (sequential searching) is unacceptably slow and does not scale up for large ink databases. In this paper, an indexing technique based on HMMs is proposed. The new index is a variation of the trie data structure that uses HMMs and a new search algorithm to provide approximate matching. Each node in the tree contains handwritten letters, where each letter is represented by an HMM. Branching in the trie is based on the ranking of matches given by the HMMs. The new search algorithm is parametrized so that it provides means for controlling the matching quality of the search process via a time-based budget. The index dramatically improves the search time in a database of handwritten words. Due to the variety of platforms for which this work is aimed, ranging from personal digital assistants to desktop computers, we implemented both main-memory and disk-based systems. The implementations are reported in this paper, along with performance results that show the practicality of the technique under a variety of conditions. SIGMOD Conference The Query By Image Content (QBIC) System. Jonathan Ashley,Myron Flickner,James L. Hafner,Denis Lee,Wayne Niblack,Dragutin Petkovic 1995 The Query By Image Content (QBIC) System. SIGMOD Conference Use of a Component Architecture in Integrating Relational and Non-relational Storage Systems. Robert G. Atkinson 1995 Use of a Component Architecture in Integrating Relational and Non-relational Storage Systems. SIGMOD Conference Applying Update Streams in a Soft Real-Time Database System. Brad Adelberg,Hector Garcia-Molina,Ben Kao 1995 "Many papers have examined how to efficiently export a materialized view but to our knowledge none have studied how to efficiently import one. To import a view, i.e., to install a stream of updates, a real-time database system must process new updates in a timely fashion to keep the database ""fresh,"" but at the same time must process transactions and ensure they meet their time constraints. In this paper, we discuss the various properties of updates and views (including staleness) that affect this tradeoff. We also examine, through simulation, four algorithms for scheduling transactions and installing updates in a soft real-time database." SIGMOD Conference A Database Interface for File Updates. Serge Abiteboul,Sophie Cluet,Tova Milo 1995 "Database systems are concerned with structured data. Unfortunately, data is still often available in an unstructured manner (e.g., in files) even when it does have a strong internal structure (e.g., electronic documents or programs). In a previous paper [2], we focussed on the use of high-level query languages to access such files and developed optimization techniques to do so. In this paper, we consider how structured data stored in files can be updated using database update languages. The interest of using database languages to manipulate files is twofold. First, it opens database systems to external data. This concerns data residing in files or data transiting on communication channels and possibly coming from other databases [2]. Secondly, it provides high level query/update facilities to systems that usually rely on very primitive linguistic support. (See [6] for recent works in this direction). Similar motivations appear in [4, 5, 7, 8, 11, 12, 13, 14, 15, 17, 19, 20, 21] In a previous paper, we introduced the notion of structuring schemas as a mean of providing a database view on structured data residing in a file. A structuring schema consists of a grammar together with semantic actions (in a database language). We also showed how queries on files expressed in a high-level query language (O2-SQL [3]) could be evaluated efficiently using variations of standard database optimization techniques. The problem of update was mentioned there but remained largely unexplored. This is the topic of the present paper. We argue that updates on files can be expressed conveniently using high-level database update languages that work on the database view of the file. The key problem is how to propagate an update specified on the database (here a view) to the file (here the physical storage). As a first step, we propose a naive way of update propagation: the database view of the file is materialized; the update is performed on the database; the database is ""unparsed"" to produce an updated file. For this, we develop an unparsing technique. The problems that we meet while developing this technique are related to the well-known view update problem. ( See, for instance [9, 10, 16, 23].) The technique relies on the existence of an inverse mapping from the database to the file. We show that the existence of such an inverse mapping results from the use of restricted structuring schemas. The naive technique presents two major drawbacks. It is inefficient: it entails intense data construction and unparsing, most of which dealing with data not involved in the update. It may result in information loss: information in the file, that is not recorded in the database, may be lost in the process. The major contribution of this paper is a combination of techniques that allows to minimize both the data construction and the unparsing work. First, we briefly show how optimization techniques from [2] can be used to focus on the relevant portion of the database and to avoid constructing the entire database. Then we show that for a class of structuring schemas satisfying a locality condition, it is possible to carefully circumscribe the unparsing. Some of the results in the paper are negative. They should not come as a surprise since we are dealing with complex theoretical foundations: language theory (for parsing and unparsing), and first-order logic (for database languages). However, we do present positive results for particular classes of structuring schemas. We believe that the restrictions imposed on these schemas are very acceptable in practice. (For instance, all ""real"" examples of structuring schemas that we examined are local.) The paper is organized as follows. In Section 2, we present the update problem and the structuring schemas; in Section 3, a naive technique for update propagation and the unparsing technique. Section 4 introduces a locality condition, and presents a more efficient technique for propagating updates in local structuring schemas. The last section is a conclusion." SIGMOD Conference Broadcast Disks: Data Management for Asymmetric Communications Environments. Swarup Acharya,Rafael Alonso,Michael J. Franklin,Stanley B. Zdonik 1995 "This paper proposes the use of repetitive broadcast as a way of augmenting the memory hierarchy of clients in an asymmetric communication environment. We describe a new technique called ""Broadcast Disks"" for structuring the broadcast in a way that provides improved performance for non-uniformly accessed data. The Broadcast Disk superimposes multiple disks spinning at different speeds on a single broadcast channel--in effect creating an arbitrarily fine-grained memory hierarchy. In addition to proposing and defining the mechanism, a main result of this work is that exploiting the potential of the broadcast structure requires a re-evaluation of basic cache management policies. We examine several ""pure"" cache management policies and develop and measure implementable approximations to these policies. These results and others are presented in a set of simulation studies that substantiates the basic idea and develops some of the intuitions required to design a particular broadcast program." SIGMOD Conference Efficient Optimistic Concurrency Control Using Loosely Synchronized Clocks. Atul Adya,Robert Gruber,Barbara Liskov,Umesh Maheshwari 1995 This paper describes an efficient optimistic concurrency control scheme for use in distributed database systems in which objects are cached and manipulated at client machines while persistent storage and transactional support are provided by servers. The scheme provides both serializability and external consistency for committed transactions; it uses loosely synchronized clocks to achieve global serialization. It stores only a single version of each object, and avoids maintaining any concurrency control information on a per-object basis; instead, it tracks recent invalidations on a per-client basis, an approach that has low in-memory space overhead and no per-object disk overhead. In addition to its low space overheads, the scheme also performs well. The paper presents a simulation study that compares the scheme to adaptive callback locking, the best concurrency control scheme for client-server object-oriented database systems studied to date. The study shows that our scheme outperforms adaptive callback locking for low to moderate contention workloads, and scales better with the number of clients. For high contention workloads, optimism can result in a high abort rate; the scheme presented here is a first step toward a hybrid scheme that we expect to perform well across the full range of workloads. SIGMOD Conference A Critique of ANSI SQL Isolation Levels. "Hal Berenson,Philip A. Bernstein,Jim Gray,Jim Melton,Elizabeth J. O'Neil,Patrick E. O'Neil" 1995 ANSI SQL-92 [MS, ANSI] defines Isolation Levels in terms of phenomena: Dirty Reads, Non-Repeatable Reads, and Phantoms. This paper shows that these phenomena and the ANSI SQL definitions fail to properly characterize several popular isolation levels, including the standard locking implementations of the levels covered. Ambiguity in the statement of the phenomena is investigated and a more formal statement is arrived at; in addition new phenomena that better characterize isolation types are introduced. Finally, an important multiversion isolation type, called Snapshot Isolation, is defined. SIGMOD Conference Semint: A System Prototype for Semantic Integration in Heterogeneous Databases. Wen-Syan Li,Chris Clifton 1995 Semint: A System Prototype for Semantic Integration in Heterogeneous Databases. SIGMOD Conference Fault Tolerant Design of Multimedia Servers. Steven Berson,Leana Golubchik,Richard R. Muntz 1995 Recent technological advances have made multimedia on-demand servers feasible. Two challenging tasks in such systems are: a) satisfying the real-time requirement for continuous delivery of objects at specified bandwidths and b) efficiently servicing multiple clients simultaneously. To accomplish these tasks and realize economies of scale associated with servicing a large user population, the multimedia server can require a large disk subsystem. Although a single disk is fairly reliable, a large disk farm can have an unacceptably high probability of disk failure. Further, due to the real-time constraint, the reliability and availability requirements of multimedia systems are very stringent. In this paper we investigate techniques for providing a high degree of reliability and availability, at low disk storage, bandwidth, and memory costs for on-demand multimedia servers. SIGMOD Conference Semantic Assumptions and Query Evaluation in Temporal Databases. Claudio Bettini,Xiaoyang Sean Wang,Elisa Bertino,Sushil Jajodia 1995 When querying a temporal database, a user often makes certain semantic assumptions on stored temporal data. This paper formalizes and studies two types of semantic assumptions: point-based and interval-based. The point-based assumptions include those assumptions that use interpolation methods, while the interval-based assumptions include those that involve different temporal types (time granularities). Each assumption is viewed as a way to derive certain implicit data from the explicit data stored in the database. The database system must use all explicit as well as (possibly infinite) implicit data to answer user queries. This paper introduces a new method to facilitate such query evaluations. A user query is translated into a system query such that the answer of this system query over the explicit data is the same as that of the user query over the explicit and the implicit data. The paper gives such a translation procedure and studies the properties (safety in particular) of user queries and system queries. SIGMOD Conference Real World Requirements for Decision Support - Implications for RDBMS. Sanju K. Bansal 1995 Real World Requirements for Decision Support - Implications for RDBMS. SIGMOD Conference Hypergraph Based Reorderings of Outer Join Queries with Complex Predicates. Gautam Bhargava,Piyush Goel,Balakrishna R. Iyer 1995 "Complex queries containing outer joins are, for the most part, executed by commercial DBMS products in an ""as written"" manner. Only a very few reorderings of the operations are considered and the benefits of considering comprehensive reordering schemes are not exploited. This is largely due to the fact there are no readily usable results for reordering such operations for relations with duplicates and/or outer join predicates that are other than ""simple."" Most previous approaches have ignored duplicates and complex predicates; the very few that have considered these aspects have suggested approaches that lead to a possibly exponential number of, and redundant intermediate joins. Since traditional query graph models are inadequate for modeling outer join queries with complex predicates, we present the needed hypergraph abstraction and algorithms for reordering such queries with joins and outer joins. As a result, the query optimizer can explore a significantly larger space of execution plans, and choose one with a low cost. Further, these algorithms are easily incorporated into well known and widely used enumeration methods such as dynamic programming." SIGMOD Conference An Overview of DB2 Parallel Edition. Chaitanya K. Baru,Gilles Fecteau,Ambuj Goyal,Hui-I Hsiao,Anant Jhingran,Sriram Padmanabhan,Walter G. Wilson 1995 An Overview of DB2 Parallel Edition. SIGMOD Conference Copy Detection Mechanisms for Digital Documents. Sergey Brin,James Davis,Hector Garcia-Molina 1995 In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters. SIGMOD Conference The LyriC Language: Querying Constraint Objects. Alexander Brodsky,Yoram Kornatzky 1995 We propose a novel data model and its language for querying object-oriented databases where objects may hold spatial, temporal or constraint data, conceptually represented by linear equality and inequality constraints. The proposed LyriC language is designed to provide a uniform and flexible framework for diverse application realms such as (1) constraint-based design in two-, three-, or higher-dimensional space, (2) large-scale optimization and analysis, based mostly on linear programming techniques, and (3) spatial and geographic databases. LyriC extends flat constraint query languages, especially those for linear constraint databases, to structurally complex objects. The extension is based on the object-oriented paradigm, where constraints are treated as first-class objects that are organized in classes. The query language is an extension of the language XSQL, and is built around the idea of extended path expressions. Path expressions in a query traverse nested structures in one sweep. Constraints are used in a query to filter stored constraints and to create new constraint objects. SIGMOD Conference The REACH Active OODBMS. Alejandro P. Buchmann,Alin Deutsch,Jürgen Zimmermann,M. Higa 1995 The REACH Active OODBMS. SIGMOD Conference "The Data That You Won't Find in Databases: Tutorial panel on data exchange formats." Peter Buneman,David Maier 1995 "The Data That You Won't Find in Databases: Tutorial panel on data exchange formats." SIGMOD Conference The Algres Testbed of CHIMERA: An Active Object-Oriented Database System. Stefano Ceri,Piero Fraternali,Stefano Paraboschi,Giuseppe Psaila 1995 The Algres Testbed of CHIMERA: An Active Object-Oriented Database System. SIGMOD Conference Join Queries with External Text Sources: Execution and Optimization Techniques. Surajit Chaudhuri,Umeshwar Dayal,Tak W. Yan 1995 Text is a pervasive information type, and many applications require querying over text sources in addition to structured data. This paper studies the problem of query processing in a system that loosely integrates an extensible database system and a text retrieval system. We focus on a class of conjunctive queries that include joins between text and structured data, in addition to selections over these two types of data. We adapt techniques from distributed query processing and introduce a novel class of join methods based on probing that is especially useful for joins with text systems, and we present a cost model for the various alternative query processing methods. Experimental results confirm the utility of these methods. The space of query plans is extended due to the additional techniques, and we describe an optimization algorithm for searching this extended space. The techniques we describe in this paper are applicable to other types of external data managers loosely integrated with a database system. SIGMOD Conference The NAOS System. Christine Collet,Thierry Coupaye 1995 The NAOS System. SIGMOD Conference An Online Video Placement Policy based on Bandwith to Space Ratio (BSR). Asit Dan,Dinkar Sitaram 1995 In a video-on-demand server, resource reservation is required to guarantee continuous delivery. Hence any given storage device (or a striping group treated as a single logical device) can serve only up to a fixed number of client access streams. Each storage device is also limited by the number of video files it can store. For the reasons of availability, incremental growth, and heterogeneity, there may be multiple storage devices in a video server environment. Hence, one or more copies of a particular video may be placed on different storage devices. Since the access rates to different videos are not uniform, there may be load imbalance among the devices. In this paper, we propose a dynamic placement policy (called the Bandwidth to Space Ratio (BSR) Policy) that creates and/or deletes replica of a video, and mixes hot and cold videos so as to make the best use of bandwidth and space of a storage device. The proposed policy is evaluated using a simulation study. SIGMOD Conference Dynamic Resource Brokering for Multi-User Query Execution. Diane L. Davison,Goetz Graefe 1995 "We propose a new framework for resource allocation based on concepts from microeconomics. Specifically, we address the difficult problem of managing resources in a multiple-query environment composed of queries with widely varying resource requirements. The central element of the framework is a resource broker that realizes a profit by ""selling"" resources to competing operators using a performance-based ""currency."" The guiding principle for brokering resources is profit maximization. In other words, since the currency is derived from the performance objective, the broker can achieve the best performance by making the scheduling and resource allocation decisions that maximize profit. Moreover, the broker employs dynamic techniques and adapts by changing previous allocation decisions while queries are executing. In a first validation study of the framework, we developed a prototype broker that manages memory and disk bandwidth for a multi-user query workload. The performance objective for the prototype broker is to minimize slowdown with the constraint of fairness. Slowdown measures how much higher the response time is in a multi-user environment than a single-user environment, and fairness measures how even is the degradation in response time among all queries as the system load increases, Our simulation results show the viability of the broker framework and the effectiveness of our query admission and resource allocation policies for multi-user workloads." SIGMOD Conference Objects and SQL: Strange Relations? Panel. Donald R. Deutsch 1995 Objects and SQL: Strange Relations? Panel. SIGMOD Conference Using the CALANDA Time Series Management System. Werner Dreyer,Angelika Kotz Dittrich,Duri Schmidt 1995 Using the CALANDA Time Series Management System. SIGMOD Conference Reducing Multidatabase Query Response Time by Tree Balancing. Weimin Du,Ming-Chien Shan,Umeshwar Dayal 1995 Execution of multidatabase queries differs from that of traditional queries in that sort merge and hash joins are more often favored, as nested loop join requires repeated accesses to external data sources. As a consequence, left deep join trees obtained by traditional (e.g., System-R style) optimizers for multidatabase queries are often suboptimal, with respect to response time, due to the long delay for a sort merge (or hash) join node to produce its last result after the subordinate join node did. In this paper, we present an optimization strategy that first produces an optimal left deep join tree and then reduces the response time using simple tree transformations. This strategy has the advantages of guaranteed minimum total resource usage, improved response time, and low optimization overhead. We describe a class of basic transformations that is the cornerstone of our approach. Then we present algorithms that effectively apply basic transformations to balance a left deep join tree, and discuss how the technique can be incorporated into existing query optimizers. SIGMOD Conference Research and Products: Are They Relevant To Each Other? (Panel). Herb Edelstein 1995 Research and Products: Are They Relevant To Each Other? (Panel). SIGMOD Conference Keynote Address. Larry Ellison 1995 Keynote Address. SIGMOD Conference Keynote Address. Robert S. Epstein 1995 Keynote Address. SIGMOD Conference Indexing Multimedia Databases (Tutorial). Christos Faloutsos 1995 Indexing Multimedia Databases (Tutorial). SIGMOD Conference FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. Christos Faloutsos,King-Ip Lin 1995 "A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the 'Query By Example' type (which translates to a range query); the 'all pairs' query (which translates to a spatial join [8]); the nearest-neighbor or best-match query, etc.However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, as discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.We introduce an older method from pattern recognition, namely, Multi-Dimensional Scaling (MDS) [51]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set." SIGMOD Conference The SPIFFI Scalable Video-on-Demand System. Craig S. Freedman,David J. DeWitt 1995 This paper presents a simulation study of a video-on-demand system. We present video server algorithms for real-time disk scheduling, prefetching, and buffer pool management. The performance of these algorithms is compared against the performance of simpler algorithms such as elevator and round-robin disk scheduling and global LRU buffer pool management. Finally, we show that the SPIFFI video-on-demand system scales nearly linearly as the number of disks, videos, and terminals is increased. SIGMOD Conference Towards an Effective Calculus for Object Query Languages. Leonidas Fegaras,David Maier 1995 We define a standard of effectiveness for a database calculus relative to a query language. Effectiveness judges suitability to serve as a processing framework for the query language, and comprises aspects of coverage, manipulability and efficient evaluation. We present the monoid calculus, and argue its effectiveness for object-oriented query languages, exemplified by OQL of ODMG-93. The monoid calculus readily captures such features as multiple collection types, aggregations, arbitrary composition of type constructors and nested query expressions. We also show how to extend the monoid calculus to deal with vectors and arrays in more expressive ways than current query languages do, and illustrate how it can handle identity and updates. SIGMOD Conference A General Solution of the n-dimensional B-tree Problem. Michael Freeston 1995 We present a generic solution to a problem which lies at the heart of the unpredictable worst-case performance characteristics of a wide class of multi-dimensional index designs: those which employ a recursive partitioning of the data space. We then show how this solution can produce modified designs with fully predictable and controllable worst-case characteristics. In particular, we show how the recursive partitioning of an n-dimensional dataspace can be represented in such a way that the characteristics of the one-dimensional B-tree are preserved in n dimensions, as far as is topologically possible i.e. a representation guaranteeing logarithmic access and update time, while also guaranteeing a one-third minimum occupancy of both data and index nodes. SIGMOD Conference "``One Size Fits All'' Database Architectures Do Not Work for DDS." Clark D. French 1995 "``One Size Fits All'' Database Architectures Do Not Work for DDS." SIGMOD Conference OFL: A Functional Execution Model for Object Query Languages. Georges Gardarin,Fernando Machuca,Philippe Pucheral 1995 "We present a functional paradigm for querying efficiently abstract collections of complex objects. Abstract collections are used to model class extents, multivalued attributes as well as indexes or hashing tables. Our paradigm includes a functional language called OFL (Object Functional Language) and a supporting execution model based on graph traversals. OFL is able to support any complex object algebra with recursion as macros. It is an appropriate target language for OQL-like query compilers. The execution model provides various strategies including set-oriented and pipelined traversals. OFL has been implemented on top of an object manager. Measures of a typical query extracted from a geographical benchmark show the value of hybrid strategies integrating pipelined and set-oriented evaluations. They also show the potential of function result memorization, a typical optimization approach known as ""Memoization"" 2 in functional languages." SIGMOD Conference Parallel Database Systems 101. Jim Gray 1995 Parallel Database Systems 101. SIGMOD Conference The SAMOS Active DBMS Prototype. Stella Gatziu,Andreas Geppert,Klaus R. Dittrich 1995 The SAMOS Active DBMS Prototype. SIGMOD Conference Incremental Maintenance of Views with Duplicates. Timothy Griffin,Leonid Libkin 1995 We study the problem of efficient maintenance of materialized views that may contain duplicates. This problem is particularly important when queries against such views involve aggregate functions, which need duplicates to produce correct results. Unlike most work on the view maintenance problem that is based on an algorithmic approach, our approach is algebraic and based on equational reasoning. This approach has a number of advantages: it is robust and easily extendible to new language constructs, it produces output that can be used by query optimizers, and it simplifies correctness proofs.We use a natural extension of the relational algebra operations to bags (multisets) as our basic language. We present an algorithm that propagates changes from base relations to materialized views. This algorithm is based on reasoning about equivalence of bag-valued expressions. We prove that it is correct and preserves a certain notion of minimality that ensures that no unnecessary tuples are computed. Although it is generally only a heuristic that computing changes to the view rather than recomputing the view from scratch is more efficient, we prove results saying that under normal circumstances one should expect, the change propagation algorithm to be significantly faster and more space efficient than complete recomputing of the view. We also show that our approach interacts nicely with aggregate functions, allowing their correct evaluation on views that change. SIGMOD Conference PTool: A Light Weight Persistent Object Manager. Robert L. Grossman,David Hanley,Xiao Qin 1995 PTool: A Light Weight Persistent Object Manager. SIGMOD Conference Informix Online XPS. Robert H. Gerber 1995 Informix Online XPS. SIGMOD Conference DIRECTV and Oracle Rdb: The Challenges of VLDB Transaction Processing. William L. Gettys 1995 DIRECTV and Oracle Rdb: The Challenges of VLDB Transaction Processing. SIGMOD Conference Storage Technology: RAID and Beyond. Garth A. Gibson 1995 Storage Technology: RAID and Beyond. SIGMOD Conference Adapting Materialized Views after Redefinitions. Ashish Gupta,Inderpal Singh Mumick,Kenneth A. Ross 1995 "We consider a variant of the view maintenance problem: How does one keep a materialized view up-to-date when the view definition itself changes? Can one do better than recomputing the view from the base relations? Traditional view maintenance tries to maintain the materialized view in response to modifications to the base relations; we try to ""adapt"" the view in response to changes in the view definition.Such techniques are needed for applications where the user can change queries dynamically and see the changes in the results fast. Data archaeology, data visualization, and dynamic queries are examples of such applications.We consider all possible redefinitions of SQL SELECT-FROM-WHERE-GROUPBY, UNION, and EXCEPT views, and show how these views can be adapted using the old materialization for the cases where it is possible to do so. We identify extra information that can be kept with a materialization to facilitate redefinition. Multiple simultaneous changes to a view can be handled without necessarily materializing intermediate results. We identify guidelines for users and database administrators that can be used to facilitate efficient view adaptation." SIGMOD Conference Things Every Update Replication Customer Should Know. Rob Golding 1995 Things Every Update Replication Customer Should Know. SIGMOD Conference Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System. Joachim Hammer,Hector Garcia-Molina,Kelly Ireland,Yannis Papakonstantinou,Jeffrey D. Ullman,Jennifer Widom 1995 Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System. SIGMOD Conference The Merge/Purge Problem for Large Databases. Mauricio A. Hernández,Salvatore J. Stolfo 1995 Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass. SIGMOD Conference Enterprise Transaction Processing on Windows NT. Greg Hope 1995 Enterprise Transaction Processing on Windows NT. SIGMOD Conference Enhancing Database Correctness: a Statistical Approach. Wen-Chi Hou,Zhongyang Zhang 1995 In this paper, we introduce a new type of integrity constraint, which we call a statistical constraint, and discuss its applicability to enhancing database correctness. Statistical constraints manifest embedded relationships among current attribute values in the database and are characterized by their probabilistic nature. They can be used to detect potential errors not easily detected by the conventional constraints. Methods for extracting statistical constraints from a relation and enforcement of such constraints are described. Preliminary performance evaluation of enforcing statistical constraints on a real life database is also presented. SIGMOD Conference DataMine - Interactive Rule Discovery System. Tomasz Imielinski,Aashu Virmani 1995 DataMine - Interactive Rule Discovery System. SIGMOD Conference Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Yannis E. Ioannidis,Viswanath Poosala 1995 Many current database systems use histograms to approximate the frequency distribution of values in the attributes of relations and based on them estimate query result sizes and access plan costs. In choosing among the various histograms, one has to balance between two conflicting goals: optimality, so that generated estimates have the least error, and practicality, so that histograms can be constructed and maintained efficiently. In this paper, we present both theoretical and experimental results on several issues related to this trade-off. Our overall conclusion is that the most effective approach is to focus on the class of histograms that accurately maintain the frequencies of a few attribute values and assume the uniform distribution for the rest, and choose for each relation the histogram in that class that is optimal for a self-join query. SIGMOD Conference High Availability of Commercial Applications. Kestutis Ivinskis 1995 "The increased performance capabilities of UNIX server systems have led to their acceptance as the server of choice for medium-sized and large organizations. But performance is just one facet. Another facet is the end user perception of the availability of an information system.Traditional mainframe based IS shops have a long experience in supplying computing services to their commercial end users. The ultimate goal of the end user is to have no downtimes for his work at his PC or workstation terminal. Key issues related to system availability in client/server based information systems remain the same as in the mainframe based world, e.g. system responsiveness, maximum downtime per year and maximum number of system outages per year. But there are also new aspects, which have been introduced into the discussion.In a multi-tiered client/server based information system the OLTP workload is distributed on different servers. Hence one can ask: Why should a failure of one server automatically imply downtime for the whole system ? Can't most of the system continue to operate ? Redistribution of the workload on the remaining active servers can be used to attack this problem. Workload balancing can be applied for replicated system services. Other techniques have to be applied for non-replicated system services.This paper considers client/server based applications running in a local or wide area network of computers in a distributed system." SIGMOD Conference The ECRC Multi Database System. Willem Jonker,Heribert Schütz 1995 The ECRC Multi Database System. SIGMOD Conference VisDB: A System for Visualizing Large Databases. Daniel A. Keim,Hans-Peter Kriegel 1995 The VisDB system developed at the University of Munich is a sophisticated tool for visualizing and analyzing large databases. The key idea of the VisDB system is to support the exploration of large databases by using the phenomenal abilities of the human vision system which is able to analyze visualizations of mid-size to large amounts of data very efficiently. The goal of the VisDB system is to provide visualizations of large portions of the database, allowing properties of the data and structure in the data to become perceptually apparent. SIGMOD Conference Enterprise Objects Framework, A Second Generation Object-Relational Enabler. Charly Kleissner 1995 "Today's information system executives desperately need to improve programmer productivity and reduce software maintenance costs. They are demanding flexibility in frameworks and architectures in order to meet unforeseen changes (see [Yankee 94]). Adaptability is a major requirement of most company's information systems efforts. Management of change is one of the key computing concepts of the 1990s.Object-oriented tools and development frameworks are starting to deliver the benefits of increased productivity and flexibility. These next-generation products now need to be combined with relational databases to leverage investments and facilitate access to business data. Object-Relational Enablers automate the process of storing complex objects in a relational database management system (see [Aberdeen 94]).The Enterprise Objects Framework product is a second generation product bringing the benefits of object-oriented programming to relational database application development. Enterprise Objects Framework enables developers to construct reusable business objects that combine business logic with persistent storage in industry-standard relational databases. Enterprise objects are first class citizens in the NEXTSTEP and OpenStep developer and user environments. They can be geographically distributed throughout heterogeneous servers within an enterprise using the Portable Distributed Objects product (see [NeXT-DO 94]).In this extended abstract we first describe the enterprise object distribution model and then give a brief synopsis of how relational data is mapped into objects. We then present an outline of the system architecture, explain how objects are mapped to multiple tables, and summarize the transaction semantics as well as the application development life-cycle. We conclude with an outlook on future development." SIGMOD Conference Order-of-Magnitude Advantage of TPC-C Though Massive Parallelism. Charles Levine 1995 TPC Benchmark™ C (TPC-C) is the modern standard for measuring OLTP performance. Running TPC-C, Tandem demonstrated a massively parallel configuration of 112 CPUs which achieved ten times higher performance than any other system previously measured (and today is still better by a factor of five). This result qualifies as the largest industry-standard benchmark ever run.This paper briefly describes how the benchmark was configured and the results which were obtained. SIGMOD Conference Efficient Maintenance of Materialized Mediated Views. James J. Lu,Guido Moerkotte,Joachim Schü,V. S. Subrahmanian 1995 "Integrating data and knowledge from multiple heterogeneous sources -- like databases, knowledge bases or specific software packages -- is often required for answering certain queries. Recently, a powerful framework for defining mediated views spanning multiple knowledge bases by a set of constrained rules was proposed [24, 4, 16]. We investigate the materialization of these views by unfolding the view definition and the efficient maintenance of the resulting materialized mediated view in case of updates. Thereby, we consider two kinds of updates: updates to the view and updates to the underlying sources. For each of these two cases several efficient algorithms maintaining materialized mediated views are given. We improve on previous algorithms like the DRed algorithm [12] and introduce a new fixpoint operator WP which -- opposed to the standard fixpoint operator TP [9] -- allows us to correctly capture the update's semantics without any recomputation of the materialized view." SIGMOD Conference QBI: Query By Icons. Antonio Massari,Stefano Pavani,Lorenzo Saladini,Panos K. Chrysanthis 1995 QBI is an icon-based query processing and exploration facility for large distributed databases [3]. As opposed to other interactive query interfaces, it combines (1) a pure iconic specification, i.e., no diagrams of any form, only icon manipulation, with (2) intensional browsing or metaquery tools that assist in the formulation of complete queries without involving path specification or access to the actual data in the database.Path expressions are automatically generated by QBI and irrespective of their length, represented by a single icon, allowing for better use of the screen. It requires no special knowledge of the content of the underlying database nor understanding of the details of the database schema. Hence, QBI is domain independent and equally useful to both unsophisticated and expert users. SIGMOD Conference An Overview of the Emerging Third-Generation SQL Standard (Tutorial). Nelson Mendonça Mattos,Jim Melton 1995 An Overview of the Emerging Third-Generation SQL Standard (Tutorial). SIGMOD Conference Recovery Protocols for Shared Memory Database Systems. Lory D. Molesky,Krithi Ramamritham 1995 Significant performance advantages can be gained by implementing a database system on a cache-coherent shared memory multiprocessor. However, problems arise when failures occur. A single node (where a node refers to a processor/memory pair) crash may require a reboot of the entire shared memory system. Fortunately, shared memory multiprocessors that isolate individual node failures are currently being developed. Even with these, because of the side effects of the cache coherency protocol, a transaction executing strictly on a single node may become dependent on the validity of the memory of many nodes thereby inducing unnecessary transaction aborts. This happens when database objects, such as records, and database support structures, such as index structures and shared lock tables, are stored in shared memory.In this paper, we propose crash recovery protocols for shared memory database systems which avoid the unnecessary transaction aborts induced by the effects of using shared physical memory. Our recovery protocols guarantee that if one or more nodes crash, all the effects of active transactions running on the crashed nodes will be undone, and no effects of transactions running on nodes which did not crash will be undone. In order to show the practicality of our protocols, we discuss how existing features of cache-coherent multiprocessors can be utilized to implement these recovery protocols. Specifically, we demonstrate that (1) for many types of database objects and support structures, volatile (in-memory) logging is sufficient to avoid unnecessary transaction aborts, and (2) a very low overhead implementation of this strategy can be achieved with existing multiprocessor features. SIGMOD Conference The Lotus Notes Storage System. Kenneth Moore 1995 Lotus Notes is a commercial product that empowers individuals and organizations to collaborate and share information [1].Notes enables the easy development of applications such as messaging, document management, workflow, and asynchronous conferencing. Notes applications can be deployed globally, across independent organizations, among a heterogeneous network of loosely coupled computers that range in size from small notebooks to large multi-processor systems.The third major release of Lotus Notes occurred in May 1993. Notes is a client-server product, with clients available on Windows, OS/2, Macintosh, SCO UNIX, HP-UX, AIX, and Solaris. The server is available on Windows, OS/2, Windows NT (for Intel processors), NetWare, SCO UNIX, HP-UX, AIX, and Solaris. SIGMOD Conference Integration Approaches for CIM. Moira C. Norrie 1995 In response to pressures to reduce product lead times, manufacturing companies are increasingly aware of the need for some form of integration along the whole product chain. Engineering tasks must be coordinated and data exchanged between the various specialised tools. An enterprise has two main tracks of information flow, namely technical and managerial, and product data management spans both tracks. On the technical track, applications are highly specialised supporting tasks such as product design (CAD) and the programming of numerically controlled machines (CAM). Generally, the various application systems on the technical track are referred to as CAx systems. CAx systems may not only differ in terms of functionality but also in terms of the amount and type of data managed, the run-time environment and performance characteristics. For complete support of Computer Integrated Manufacturing (CIM), we must be able to integrate existing technical and administrative component application systems. These component systems vary in their data management support and many CAx systems store their data directly in files rather than in a database system. The issues are how to describe the dependencies between these component systems and ensure system-wide data consistency. A particularly difficult problem is that of how to interface existing application systems in such a way that their operation can be monitored and controlled and a global transaction scheme provided. Integration must be achieved in a way that supports system evolution in terms of the introduction and replacement of application systems. This is particularly important given the trend towards the notions of the extended enterprise and virtual factories in which a particular product chain may span several enterprises. Further, emerging legal statutes (especially in relation to environmental factors) are resulting in changes to the requirements for product data management. Enterprise integration must be both flexible and dynamic. The best way of achieving this is to integrate component systems by mans of a control layer which coordinates tasks based on explicit inter-system dependencies amenable to both direct view and update. Product Data Management systems (also known as Engineering Databases) have been developed for the integration of CAx systems by managing product data centrally. One problem with a centralised system controlling access to data is that its availability is critical to the operation of all component systems. In an effort to overcome this, systems are being developed which replicate the metadata and data required for coordination. Alternatively, coordination approaches have been proposed which aim to maximise component subsystem autonomy and increase CIM system flexibility. These systems place less emphasis on data sharing and more emphasis on task coordination. The integration effort is minimised and only that information strictly essential to coordination is managed centrally. SIGMOD Conference Cover Your Assets. Michael A. Olson 1995 Cover Your Assets. SIGMOD Conference Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees. Dimitris Papadias,Yannis Theodoridis,Timos K. Sellis,Max J. Egenhofer 1995 Topological Relations in the World of Minimum Bounding Rectangles: A Study with R-trees. SIGMOD Conference An Effective Hash Based Algorithm for Mining Association Rules. Jong Soo Park,Ming-Syan Chen,Philip S. Yu 1995 In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing a candidate set of itemsets first and then, identifying, within this candidate set, those itemsets that meet the large itemset requirement. Generally this is done iteratively for each large k-itemset in increasing order of k where a large k-itemset is a large itemset with k items. To determine large itemsets from a huge number of candidate large itemsets in early iterations is usually the dominating factor for the overall data mining performance. To address this issue, we propose an effective hash-based algorithm for the candidate set generation. Explicitly, the number of candidate 2-itemsets generated by the proposed algorithm is, in orders of magnitude, smaller than that by previous methods, thus resolving the performance bottleneck. Note that the generation of smaller candidate sets enables us to effectively trim the transaction database size at a much earlier stage of the iterations, thereby reducing the computational cost for later iterations significantly. Extensive simulation study is conducted to evaluate performance of the proposed algorithm. SIGMOD Conference Efendi: Federated Database System of Cadlab. Elke Radeke,Ralf Böttger,Bernd Burkert,Yaron Engel,Gerd Kachel,Silvia Kolmschlag,Dietmar Nolte 1995 Efendi: Federated Database System of Cadlab. SIGMOD Conference Leveraging The Information Asset. Janet Perna 1995 Data is a corporate asset, and being able to derive more information from data can provide database users with a competitive advantage. For example, catching on to trends quickly can reduce unwanted store inventory and lower capital outlay for the same profit. If you have store sales data by product analyzed on a daily basis, that can make a 2-3% difference in margin -- and in a business where margins might be 4%, this is a significant competitive edge. This paper will cover what technology is needed by customers to leverage their information assets. Real-time access to production point of sale information, database mining for analysis to detect trends immediately, high performance, and multi-vendor database connectivity, cooperation among heterogeneous clients and servers are among the customer needs we are seeing in the marketplace. SIGMOD Conference OODB Indexing by Class-Division. Sridhar Ramaswamy,Paris C. Kanellakis 1995 "Indexing a class hierarchy, in order to efficiently search or update the objects of a class according to a (range of) value(s) of an attribute, impacts OODB performance heavily. For this indexing problem, most systems use the class hierarchy index (CH) technique of [15] implemented using B+-trees. Other techniques, such as those of [14, 18,31], can lead to improved average-case performance but involve the implementation of new data-structures. As a special form of external dynamic two-dimensional range searching, this OODB indexing problem is solvable within reasonable worst-case bounds [12]. Based on this insight, we have developed a technique, called indexing by class-division (CD), which we believe can be used as a practical alternative to CH. We present an optimized implementation and experimental validation of CD's average-case performance. The main advantages of the CD technique are: (1) CD is an extension of CH that provides a significant speed-up over CH for a wide spectrum of range queries--this speed-up is at least linear in the number of classes queried for uniform data and larger otherwise; and (2) CD queries, updates and concurrent use are implementable using existing B+-tree technology. The basic idea of class-division involves a time-space tradeoff and CD requires some space and update overhead in comparison to CH. In practice, this overhead is a small factor (2 to 3) and, in worst-case, is bounded by the depth of the hierarchy and the logarithm of its size." SIGMOD Conference Nearest Neighbor Queries. Nick Roussopoulos,Stephen Kelley,Frédéic Vincent 1995 A frequently encountered type of query in Geographic Information Systems is to find the k nearest neighbor objects to a given point in space. Processing such queries requires substantially different search algorithms than those for location or range queries. In this paper we present an efficient branch-and-bound R-tree traversal algorithm to find the nearest neighbor object to a point, and then generalize it to finding the k nearest neighbors. We also discuss metrics for an optimistic and a pessimistic search ordering strategy as well as for pruning. Finally, we present the results of several experiments obtained using the implementation of our algorithm and examine the behavior of the metrics and the scalability of the algorithm. SIGMOD Conference Adaptive Parallel Aggregation Algorithms. Ambuj Shatdal,Jeffrey F. Naughton 1995 Aggregation and duplicate removal are common in SQL queries. However, in the parallel query processing literature, aggregate processing has received surprisingly little attention; furthermore, for each of the traditional parallel aggregation algorithms, there is a range of grouping selectivities where the algorithm performs poorly. In this work, we propose new algorithms that dynamically adapt, at query evaluation time, in response to observed grouping selectivities. Performance analysis via analytical modeling and an implementation on a workstation-cluster shows that the proposed algorithms are able to perform well for all grouping selectivities. Finally, we study the effect of data skew and show that for certain data sets the proposed algorithms can even outperform the best of traditional approaches. SIGMOD Conference Workflow Automation: Applications, Technology, and Research (Tutorial). Amit P. Sheth 1995 Workflow Automation: Applications, Technology, and Research (Tutorial). SIGMOD Conference InfoHarness: A System for Search and Retrieval of Heterogeneous Information. Leon A. Shklar,Amit P. Sheth,Vipul Kashyap,Satish Thatte 1995 Enormous amounts of heterogeneous information have been accumulated within corporations, government organizations and universities. It is becoming increasingly easier to create new information, but the knowledge about the existence, location, and means of retrieval of information, have become so confusing as to give rise to the phenomenon of write-only databases. SIGMOD Conference VERSANT Replication: Supporting Fault-Tolerant Object Databases. Yuh-Ming Shyy,H. Stephen Au-Yeung,C. P. Chou 1995 VERSANT Replication: Supporting Fault-Tolerant Object Databases. SIGMOD Conference Temporal Conditions and Integrity Constraints in Active Database Systems. A. Prasad Sistla,Ouri Wolfson 1995 In this paper, we present a unified formalism, based on Past Temporal Logic, for specifying conditions and events in the rules for active database system. This language permits specification of many time varying properties of database systems. It also permits specification of temporal aggregates. We present an efficient incremental algorithm for detecting conditions specified in this language. The given algorithm, for a subclass of the logic, was implemented on top of Sybase. SIGMOD Conference Object-Oriented, Rapid Application Development in a PC Database Environment. Fox Development Team 1995 Object-Oriented, Rapid Application Development in a PC Database Environment. SIGMOD Conference Data Extraction and Transformation for the Data Warehouse. Cass Squire 1995 Data Extraction and Transformation for the Data Warehouse. SIGMOD Conference Paradise: A Database System for GIS Applications. Paradise Team 1995 Paradise: A Database System for GIS Applications. SIGMOD Conference SHORE: Combining the Best Features of OODBMS and File Systems. Shore Team 1995 SHORE: Combining the Best Features of OODBMS and File Systems. SIGMOD Conference Upsizing from File Server to Clent Server Architectures. The Access Team 1995 Upsizing from File Server to Clent Server Architectures. SIGMOD Conference Design and Implementation of Advanced Knowledge Processing in the KDMS KRISYS (Demonstration Description). Joachim Thomas,Stefan Deßloch,Nelson Mendonça Mattos 1995 Design and Implementation of Advanced Knowledge Processing in the KDMS KRISYS (Demonstration Description). SIGMOD Conference Pattern Matching and Pattern Discovery in Scientific, Program, and Document Databases. Jason Tsong-Li Wang,Kaizhong Zhang,Dennis Shasha 1995 Over the past several years we have created or borrowed algorithms for combinatorial pattern matching and pattern discovery on sequences [2] and trees.In matching problems, given a pattern, a set of data objects and a distance metric, we find the distance between the pattern and one or more data objects. In discovery problems by contrast, given a set of objects, a metric, and a distance, we seek a pattern that matches many of those objects within the given distance. (So, discovery is a lot like data mining.) Our toolkit performs both matching and discovery with current targeted applications in molecular biology and document comparison. SIGMOD Conference Implementing Crash Recovery in QuickStore: A Performance Study. Seth J. White,David J. DeWitt 1995 "Implementing crash recovery in an Object-Oriented Database System (OODBMS) raises several challenging issues for performance that are not present in traditional DBMSs. These performance concerns result both from significant architectural differences between OODBMSs and traditional database systems and differences in OODBMS's target applications. This paper compares the performance of several alternative approaches to implementing crash recovery in an OODBMS based on a client-server architecture. The four basic recovery techniques examined in the paper are termed page differencing, sub-page differencing, whole-page logging, and redo-at-server. All of the recovery techniques were implemented in the context of QuickStore, a memory-mapped store built using the EXODUS Storage Manager, and their performance is compared using the OO7 database benchmark. The results of the performance study show that the techniques based on differencing generally provide superior performance to whole-page logging." SIGMOD Conference Parallel Evaluation of Multi-Join Queries. Annita N. Wilschut,Jan Flokstra,Peter M. G. Apers 1995 A number of execution strategies for parallel evaluation of multi-join queries have been proposed in the literature; their performance was evaluated by simulation. In this paper we give a comparative performance evaluation of four execution strategies by implementing all of them on the same parallel database system, PRISMA/DB. Experiments have been done up to 80 processors. The basic strategy is to first determine an execution schedule with minimum total cost and then parallelize this schedule with one of the four execution strategies. These strategies, coming from the literature, are named: Sequential Parallel, Synchronous Execution, Segmented Right-Deep, and Full Parallel. Based on the experiments clear guidelines are given when to use which strategy. SIGMOD Conference Carnot and InfoSleuth: Database Technology and the World Wide Web. Darrell Woelk,William Bohrer,Nigel Jacobs,KayLiang Ong,Christine Tomlinson,C. Unnikrishnan 1995 Carnot and InfoSleuth: Database Technology and the World Wide Web. SIGMOD Conference View Maintenance in a Warehousing Environment. Yue Zhuge,Hector Garcia-Molina,Joachim Hammer,Jennifer Widom 1995 "A warehouse is a repository of integrated information drawn from remote data sources. Since a warehouse effectively implements materialized views, we must maintain the views as the data sources are updated. This view maintenance problem differs from the traditional one in that the view definition and the base data are now decoupled. We show that this decoupling can result in anomalies if traditional algorithms are applied. We introduce a new algorithm, ECA (for ""Eager Compensating Algorithm""), that eliminates the anomalies. ECA is based on previous incremental view maintenance algorithms, but extra ""compensating"" queries are used to eliminate anomalies. We also introduce two streamlined versions of ECA for special cases of views and updates, and we present an initial performance study that compares ECA to a view recomputation algorithm in terms of messages transmitted, data transferred, and I/O costs." VLDB Using Formal Methods to Reason about Semantics-Based Decompositions of Transactions. Paul Ammann,Sushil Jajodia,Indrakshi Ray 1995 Using Formal Methods to Reason about Semantics-Based Decompositions of Transactions. VLDB Efficient Incremental Garbage Collection for Client-Server Object Database Systems. Laurent Amsaleg,Michael J. Franklin,Olivier Gruber 1995 Efficient Incremental Garbage Collection for Client-Server Object Database Systems. VLDB Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. Rakesh Agrawal,King-Ip Lin,Harpreet S. Sawhney,Kyuseok Shim 1995 Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. VLDB Querying Shapes of Histories. Rakesh Agrawal,Giuseppe Psaila,Edward L. Wimmers,Mohamed Zaït 1995 Querying Shapes of Histories. VLDB A Practical and Modular Implementation of Extended Transaction Models. Roger S. Barga,Calton Pu 1995 A Practical and Modular Implementation of Extended Transaction Models. VLDB A Non-Uniform Data Fragmentation Strategy for Parallel Main-Menory Database Systems. Nick Bassiliades,Ioannis P. Vlahavas 1995 A Non-Uniform Data Fragmentation Strategy for Parallel Main-Menory Database Systems. VLDB Document Management as a Database Problem. Rudolf Bayer 1995 Document Management as a Database Problem. VLDB Value-cognizant Speculative Concurrency Control. Azer Bestavros,Spyridon Braoudakis 1995 Value-cognizant Speculative Concurrency Control. VLDB "Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension." Alberto Belussi,Christos Faloutsos 1995 "Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension." VLDB Applying Database Technology in the ADSM Mass Storage System. Luis-Felipe Cabrera,Robert M. Rees,Wayne Hineman 1995 Applying Database Technology in the ADSM Mass Storage System. VLDB A Data Transformation System for Biological Data Sources. Peter Buneman,Susan B. Davidson,Kyle Hart,G. Christian Overton,Limsoon Wong 1995 A Data Transformation System for Biological Data Sources. VLDB Near Neighbor Search in Large Metric Spaces. Sergey Brin 1995 Near Neighbor Search in Large Metric Spaces. VLDB OS Support for VLDBs: Unix Enhancements for the Teradata Data Base. John Catozzi,Sorana Rabinovici 1995 OS Support for VLDBs: Unix Enhancements for the Teradata Data Base. VLDB BigSur: A System For the Management of Earth Science Data. Paul Brown,Michael Stonebraker 1995 BigSur: A System For the Management of Earth Science Data. VLDB Declustering Databases on Heterogeneous Disk Systems. Ling Tony Chen,Doron Rotem,Sridhar Seshadri 1995 Declustering Databases on Heterogeneous Disk Systems. VLDB Retrieval of Composite Multimedia Objects. Surajit Chaudhuri,Shahram Ghandeharizadeh,Cyrus Shahabi 1995 Retrieval of Composite Multimedia Objects. VLDB Database De-Centralization - A Practical Approach. Tor Didriksen,César A. Galindo-Legaria,Eirik Dahle 1995 Database De-Centralization - A Practical Approach. VLDB A Performance Evaluation of OID Mapping Techniques. André Eickler,Carsten Andreas Gerlhof,Donald Kossmann 1995 A Performance Evaluation of OID Mapping Techniques. VLDB The hBP-tree: A Modified hB-tree Supporting Concurrency, Recovery and Node Consolidation. Georgios Evangelidis,David B. Lomet,Betty Salzberg 1995 The hBP-tree: A Modified hB-tree Supporting Concurrency, Recovery and Node Consolidation. VLDB Managing a DB2 Parallel Edition Database. Gilles Fecteau 1995 Managing a DB2 Parallel Edition Database. VLDB Schema and Database Evolution in the O2 Object Database System. Fabrizio Ferrandina,Thorsten Meyer,Roberto Zicari,Guy Ferran,Joëlle Madec 1995 Schema and Database Evolution in the O2 Object Database System. VLDB Processing Object-Oriented Queries with Invertible Late Bound Functions. Staffan Flodin,Tore Risch 1995 Processing Object-Oriented Queries with Invertible Late Bound Functions. VLDB A Cost Model for Clustered Object-Oriented Databases. Georges Gardarin,Jean-Robert Gruser,Zhao-Hui Tang 1995 A Cost Model for Clustered Object-Oriented Databases. VLDB Improving Performance in Replicated Databases through Relaxed Coherency. Rainer Gallersdörfer,Matthias Nicola 1995 Improving Performance in Replicated Databases through Relaxed Coherency. VLDB Index Concurrency Control in Firm Real-Time Database Systems. Brajesh Goyal,Jayant R. Haritsa,S. Seshadri,V. Srinivasan 1995 Index Concurrency Control in Firm Real-Time Database Systems. VLDB Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. Luis Gravano,Hector Garcia-Molina 1995 As large numbers of text databases have become available on the Internet, it is getting harder to locate the right sources for given queries. In this paper we present gGlOSS, a generalized Glossary-Of-Servers Server, that keeps statistics on the available databases to estimate which databases are the potentially most useful for a given query. gGlOSS extends our previous work, which focused on databases using the boolean model of document retrieval, to cover databases using the more sophisticated vector-space retrieval model. We evaluate our new techniques using real-user queries and 53 databases. Finally, we further generalize our approach by showing how to build a hierarchy of gGlOSS brokers. The top level of the hierarchy is so small it could be widely replicated, even at end-user workstations. VLDB Aggregate-Query Processing in Data Warehousing Environments. Ashish Gupta,Venky Harinarayan,Dallan Quass 1995 Aggregate-Query Processing in Data Warehousing Environments. VLDB Sampling-Based Estimation of the Number of Distinct Values of an Attribute. Peter J. Haas,Jeffrey F. Naughton,S. Seshadri,Lynne Stokes 1995 Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB OPOSSUM: Desk-Top Schema Management through Customizable Visualization. Eben M. Haber,Yannis E. Ioannidis,Miron Livny 1995 OPOSSUM: Desk-Top Schema Management through Customizable Visualization. VLDB The Oracle Warehouse. Gary Hallmark 1995 The Oracle Warehouse. VLDB Discovery of Multiple-Level Association Rules from Large Databases. Jiawei Han,Yongjian Fu 1995 Discovery of Multiple-Level Association Rules from Large Databases. VLDB Coloring Away Communication in Parallel Query Optimization. Waqar Hasan,Rajeev Motwani 1995 Coloring Away Communication in Parallel Query Optimization. VLDB Generalized Search Trees for Database Systems. Joseph M. Hellerstein,Jeffrey F. Naughton,Avi Pfeffer 1995 Generalized Search Trees for Database Systems. VLDB Benchmarking Spatial Join Operations with Spatial Output. Erik G. Hoel,Hanan Samet 1995 Benchmarking Spatial Join Operations with Spatial Output. VLDB The ClustRa Telecom Database: High Availability, High Throughput, and Real-Time Response. Svein-Olaf Hvasshovd,Øystein Torbjørnsen,Svein Erik Bratsberg,Per Holager 1995 The ClustRa Telecom Database: High Availability, High Throughput, and Real-Time Response. VLDB Flexible Relations - Operational Support of Variant Relational Structures. Christian Kalus,Peter Dadam 1995 Flexible Relations - Operational Support of Variant Relational Structures. VLDB W3QS: A Query System for the World-Wide Web. David Konopnicki,Oded Shmueli 1995 W3QS: A Query System for the World-Wide Web. VLDB High-Concurrency Locking in R-Trees. Marcel Kornacker,Douglas Banks 1995 High-Concurrency Locking in R-Trees. VLDB The Double Life of the Transaction Abstraction: Fundamental Principle and Evolving System Concept. Henry F. Korth 1995 The Double Life of the Transaction Abstraction: Fundamental Principle and Evolving System Concept. VLDB Efficient Search of Multi-Dimensional B-Trees. Harry Leslie,Rohit Jain,Dave Birdsall,Hedieh Yaghmai 1995 Efficient Search of Multi-Dimensional B-Trees. VLDB DB2 Common Server: Technology, Progress, & Directions. Bruce G. Lindsay 1995 DB2 Common Server: Technology, Progress, & Directions. VLDB Redo Recovery after System Crashes. David B. Lomet,Mark R. Tuttle 1995 Redo Recovery after System Crashes. VLDB NeuroRule: A Connectionist Approach to Data Mining. Hongjun Lu,Rudy Setiono,Huan Liu 1995 NeuroRule: A Connectionist Approach to Data Mining. VLDB The Fittest Survives: An Adaptive Approach to Query Optimization. Hongjun Lu,Kian-Lee Tan,Son Dao 1995 The Fittest Survives: An Adaptive Approach to Query Optimization. VLDB From VLDB to VMLDB (Very MANY Large Data Bases): Dealing with Large-Scale Semantic Heterogenity. Stuart E. Madnick 1995 From VLDB to VMLDB (Very MANY Large Data Bases): Dealing with Large-Scale Semantic Heterogenity. VLDB Managing Intra-operator Parallelism in Parallel Database Systems. Manish Mehta,David J. DeWitt 1995 Managing Intra-operator Parallelism in Parallel Database Systems. VLDB "Providing Database Migration Tools - A Practicioner's Approach." Andreas Meier 1995 "Providing Database Migration Tools - A Practicioner's Approach." VLDB A Scalable Architecture for Autonomous Heterogeneous Database Interactions. Steven Milliner,Athman Bouguettaya,Mike P. Papazoglou 1995 A Scalable Architecture for Autonomous Heterogeneous Database Interactions. VLDB Hot Block Clustering for Disk Arrays with Dynamic Striping. Kazuhiko Mogi,Masaru Kitsuregawa 1995 Hot Block Clustering for Disk Arrays with Dynamic Striping. VLDB L/MRP: A Buffer Management Strategy for Interactive Continuous Data Flows in a Multimedia DBMS. Frank Moser,Achim Kraiss,Wolfgang Klas 1995 L/MRP: A Buffer Management Strategy for Interactive Continuous Data Flows in a Multimedia DBMS. VLDB Accessing a Relational Database through an Object-Oriented Database Interface. Jack A. Orenstein,D. N. Kamber 1995 Accessing a Relational Database through an Object-Oriented Database Interface. VLDB Dynamic Multi-Resource Load Balancing in Parallel Database Systems. Erhard Rahm,Robert Marek 1995 Dynamic Multi-Resource Load Balancing in Parallel Database Systems. VLDB Scientific Journals: Extinction or Explosion? (Panel). Raghu Ramakrishnan,Hector Garcia-Molina,Gerhard Rossbach,Abraham Silberschatz,Gio Wiederhold,Jaco Zijlstra 1995 Scientific Journals: Extinction or Explosion? (Panel). VLDB Databases and Workflow Management: What is it All About? (Panel). Andreas Reuter,Stefano Ceri,Jim Gray,Betty Salzberg,Gerhard Weikum 1995 Databases and Workflow Management: What is it All About? (Panel). VLDB Towards a Cooperative Transaction Model - The Cooperative Activity Model. Marek Rusinkiewicz,Wolfgang Klas,Thomas Tesch,Jürgen Wäsch,Peter Muth 1995 Towards a Cooperative Transaction Model - The Cooperative Activity Model. VLDB Query Processing in Tertiary Memory Databases. Sunita Sarawagi 1995 Query Processing in Tertiary Memory Databases. VLDB An Efficient Algorithm for Mining Association Rules in Large Databases. Ashok Savasere,Edward Omiecinski,Shamkant B. Navathe 1995 An Efficient Algorithm for Mining Association Rules in Large Databases. VLDB Metrics for Accessing Heterogeneous Data: Is There Any Hope? (Panel). Leonard J. Seligman,Nicholas J. Belkin,Erich J. Neuhold,Michael Stonebraker,Gio Wiederhold 1995 Metrics for Accessing Heterogeneous Data: Is There Any Hope? (Panel). VLDB Promises and Realities of Active Database Systems. Eric Simon,Angelika Kotz Dittrich 1995 Promises and Realities of Active Database Systems. VLDB Similarity based Retrieval of Pictures Using Indices on Spatial Relationships. A. Prasad Sistla,Clement T. Yu,Chengwen Liu,King-Lup Liu 1995 Similarity based Retrieval of Pictures Using Indices on Spatial Relationships. VLDB Informix-Online XPS: A Dynamically Scalable RDBMS for Open Parallel Platforms. Hannes Spintzik 1995 Informix-Online XPS: A Dynamically Scalable RDBMS for Open Parallel Platforms. VLDB Mining Generalized Association Rules. Ramakrishnan Srikant,Rakesh Agrawal 1995 Mining Generalized Association Rules. VLDB Bypassing Joins in Disjunctive Queries. Michael Steinbrunn,Klaus Peithner,Guido Moerkotte,Alfons Kemper 1995 Bypassing Joins in Disjunctive Queries. VLDB Procedures in Object-Oriented Query Languages. Kazimierz Subieta,Yahiko Kambayashi,Jacek Leszczylowski 1995 Procedures in Object-Oriented Query Languages. VLDB A Product Specification Database for Visual Prototyping. Kazutoshi Sumiya,Kouichi Yasutake,Hirohiko Tanaka,Norio Sanada,Yoshihiko Imai 1995 A Product Specification Database for Visual Prototyping. VLDB Type Classification of Semi-Structured Documents. Markus Tresch,Neal Palmer,Allen Luniewski 1995 Type Classification of Semi-Structured Documents. VLDB Very Large Databases: How Large, How Different? David Vaskevitch 1995 Very Large Databases: How Large, How Different? VLDB DB2 Query Parallelism: Staging and Implementation. Yun Wang 1995 DB2 Query Parallelism: Staging and Implementation. VLDB OODB Bulk Loading Revisited: The Partitioned-List Approach. Janet L. Wiener,Jeffrey F. Naughton 1995 OODB Bulk Loading Revisited: The Partitioned-List Approach. VLDB A Performance Study of Workfile Disk Management for Concurrent Mergesorts in a Multiprocessor Database System. Kun-Lung Wu,Philip S. Yu,Jen-Yao Chung,James Z. Teng 1995 A Performance Study of Workfile Disk Management for Concurrent Mergesorts in a Multiprocessor Database System. VLDB Duplicate Removal in Information System Dissemination. Tak W. Yan,Hector Garcia-Molina 1995 Duplicate Removal in Information System Dissemination. VLDB Eager Aggregation and Lazy Aggregation. Weipeng P. Yan,Per-Åke Larson 1995 Eager Aggregation and Lazy Aggregation. SIGMOD Record "Response to ""A Close Look at the IFO Data Model""." Serge Abiteboul,Richard Hull 1995 "Response to ""A Close Look at the IFO Data Model""." SIGMOD Record "ACM Multimedia '94 Conference Workshop on Multimedia Database Management Systems." P. Bruce Berra,Kingsley C. Nwosu,Bhavani M. Thuraisingham 1995 "This paper describes the ACM Multimedia '94 Conference Workshop on Multimedia Database Management Systems held on 21 October 1994 in San Francisco, California. The workshop consisted of four sessions: designing multimedia database management systems, video and continuous media service, multimedia storage and retrieval management, and miscellaneous topics in multimedia data management. The workshop concluded with a discussion session on directions for multimedia database management. Twenty-eight participants from U.S.A., U.K., Germany, Norway, and Egypt attended the workshop." SIGMOD Record Digital Library Services in Mobile Computing. Bharat K. Bhargava,Melliyal Annamalai,Evaggelia Pitoura 1995 Digital libraries bring about the integration, management, and communication of gigabytes of multimedia data in a distributed environment. Digital library systems currently envision users as being static when they access information. But it is expected in the near future that tens of millions of users will have access to a digital library through wireless access. Providing digital library services to users whose location is constantly changing, whose network connections are through a wireless medium, and whose computing power is low necessitates modifications to existing digital library systems. In this paper, we identify the issues that arise when users are mobile, classify queries that are specific to mobile users and introduce an architecture that supports flexible and transparent access to digital libraries for mobile users. The main features of the architecture include a layered data representation, support of adaptability, dual broadcast and on demand querying, caching, and mobile-specific user interfaces. SIGMOD Record Trade Press News. Rafael Alonso 1995 Trade Press News. SIGMOD Record Trade Press News. Rafael Alonso 1995 Trade Press News. SIGMOD Record Managing Video Data in a Mobile Environment. Rafael Alonso,Yuh-Lin Chang,Liviu Iftode,V. S. Mani 1995 "Two key technological trends of the last few years have been the emergence of handheld computational elements and the implementation of practical wireless communication networks. These two changes have made mobile computer systems feasible. While there has been much research interest devoted to mobile computer issues, such systems have not yet been commercially successful. This has been ascribed to the lack of a killer mobile app. We believe that the support of video on mobile systems will indeed make possible many new interesting applications. However, providing mobile video is a non-trivial task, and much work needs to be done before practical systems are widely available. In this short note we address the issue of mobile multimedia from a practitioner's perspective. We note what software and hardware are currently available in the market in support of mobile multimedia, and point out some of their deficiencies. We also discuss some of the communication and data management research issues that need to be tackled in order to address said deficiencies. Exploring these research issues is the focus of our project." SIGMOD Record Temporal Database System Implementations. Michael H. Böhlen 1995 Although research on temporal database systems has been active for about 20 years, implementations have not appeared until recently. This is one reason why current commercial database systems provide only limited temporal functionality. This paper summarizes extant state of the art of temporal database implementations. Rather than being very specific about each system we have attempted to provide an indication of the functionality together with pointers to additional information. It is hoped that this leads to more efforts pushing the implementation of temporal database systems in the near future. SIGMOD Record An Annotated Bibliography of Benchmarks for Object Databases. Akmal B. Chaudhri 1995 "This annotated bibliography presents a collection of published papers, technical reports, Master's and PhD Theses that have investigated various aspects of object database performance." SIGMOD Record The Third Manifesto. Hugh Darwen,C. J. Date 1995 "We present a manifesto for the future direction of data and database management systems. The manifesto consists of a series of prescriptions, proscriptions, and ""very strong suggestions.""" SIGMOD Record Design and User Testing of a Multi-Paradigm Query Interface to an Object-Oriented Database. Dac Khoa Doan,Norman W. Paton,Alistair C. Kilgour 1995 This paper reports on experience obtained during the design, implementation and use of a multi-paradigm query interface to an object-oriented database. The specific system which has been developed allows equivalent data retrieval tasks to be expressed using textual, form-based and graph-based notations, and supports automatic translation of queries between these three paradigms. The motivation behind the development of such an interface is presented, as is the software architecture which supports the multi-paradigm functionality. Feedback from initial user trials with a dual-paradigm version of the system indicates that users can use it to perform complex query tasks without difficulty, that given the choice users overwhelmingly prefer the graph- based to the text-based interaction style, and that graphical visualisation of textual queries appears to aid users in query construction. SIGMOD Record Implementation Aspects of an Object-Oriented DBMS. Asuman Dogac,Mehmet Altinel,Cetin Ozkan,Ilker Durusoy 1995 This paper describes the design and implementation of an OODBMS, namely the METU Object-Oriented DBMS (MOOD). MOOD [Dog 94b] is developed on the Exodus Storage Manager (ESM) [ESM 92] and therefore some of the kernel functions like storage management, concurrency control, backup and recovery of data were readily available through ESM. In addition ESM has a client-server architecture and each MOOD process is a client application in ESM. The kernel functions provided by MOOD are the optimization and interpretation of SQL statements, dynamic linking of functions, and catalog management. SQL statements are interpreted whereas functions (which have been previously compiled with C++) within SQL statements are dynamically linked and executed. A query optimizer is implemented by using the Volcano Query Optimizer Generator. A graphical user interface, namely Mood-View [Arp 93a, Arp 93b], is developed using Motif. MoodView displays both the schema information and the query results graphically. Additionally it is possible to update the database schema and to traverse the references in query results graphically.The system is coded in GNU C++ on Sun Sparc 2 workstations. MOOD has a SQL-like object-oriented query language, namely MOODSQL [Ozk 93b, Dog 94c]. MOOD type system is derived from C++, thus eliminating the impedance mismatch between MOOD and C++. The users can also access the MOOD Kernel from their application programs written in C++. For this purpose MOOD Kernel defines a class named UserRequest that contains a method for the execution of MOODSQL statements. The MOOD source code is available both for anonymous ftp users from ftp.cs.wisc.edu and for the WWW users from the site http://www.srdc.metu.edu.tr along with its related documents.In MOOD, each object is given a unique Object Identifier (OID) at object creation time by the ESM which is the disk start address of the object returned by the ESM. Object encapsulation is considered in two parts, method encapsulation and attribute encapsulation. These encapsulation properties are similar to the public and private declarations of C++.Methods can be defined in C++ by users to manipulate user defined classes and after compilation, they are dynamically linked and executed during the interpretation of SQL statements. This late binding facility is essential since database environments enforce run-time modification of schema and objects. With our approach, the interpretation of functions are avoided thus increasing the efficiency of the system. Dynamic linking primitives are implemented by the use of the shared object facility of SunOS [Sun 90]. Overloading is realized by making use of the signature concept of C++. SIGMOD Record METU Interoperable Database System. Asuman Dogac,Cevdet Dengi,Ebru Kilic,Gökhan Özhan,Fatma Ozcan,Sena Nural,Cem Evrendilek,Ugur Halici,Ismailcem Budak Arpinar,Pinar Koksal,N. Kesim,Sema Mancuhan 1995 "METU INteroperable Database System (MIND) is a multidatabase system that aims at achieving interoperability among heterogeneous, federated DBMSs. MIND architecture if based on OMG distributed object management model. It is implemented on top of a CORBA compliant ORB, namely, ObjectBroker. MIND provides users a single ODMG-93 compliant common data model, and a single global query language based on SQL. This makes it possible to incorporate both relational and object oriented databases into the system. Currently Oracle 7, Sybase and METU OODBMS (MOOD) have been incorporated into MIND. The main components of MIND are a global query processor, a global transaction manager, a schema integrator, interfaces to supported database systems and a user graphical interface.In MIND all local databases are encapsulated in a generic database object with a well defined single interface. This approach hides the differences between local databases from the rest of the system. The integration of export schemas is currently performed manually by using an object definition language (ODL) which is based on OMG's interface definition language. The DBA builds the integrated schema as a view over export schemas. the functionalities of ODL allow selection and restructuring of schema elements from existing local schemas.MIND global query optimizer aims at maximizing the parallel execution of the intersite joins of the global subqueries. Through MIND global transaction manager, the serializable execution of the global transactions are provided." SIGMOD Record Data and Knowledge Base Research at Hong Kong University of Science and Technology. Pamela Drew,Babak Hamidzadeh,Kamalakar Karlapalem,Alex Chia-Yee Kean,Dik Lun Lee,Qing Li,Frederick H. Lochovsky,Chung-Dak Shum,Beat Wüthrich 1995 "The National Technical University of Athens (NTUA) is the leading Technical University in Greece. The Computer Science Division of the Electrical and Computer Engineering Department covers several fields of practical, theoretical and technical computer science and is involved in several research projects supported by the EEC, the government and industrial companies. The Knowledge and Data Base Systems (KDBS) Laboratory was established in 1992 at the National Technical University of Athens. It is recognised internationally, evidenced by its participation as a central node in the Esprit Network of Excellence IDOMENEUS. The Information and Data on Open MEdia for NEtworks of USers, project aims to coordinate and improve European efforts in the development of next-generation information environments which will be capable of maintaining and communicating a largely extended class of information in an open set of media. The KDBS Laboratory employs one full-time research engineer and several graduate students. Its infrastructure includes a LAN with several DECstation 5000/200 and 5000/240 workstations, an HP Multimedia Workstation, several PCs and software for database and multimedia applications. The basic research interests of our Laboratory include: Spatial Database Systems, Multimedia Database Systems and Active Database Systems. Apart from the above database areas, interests of the KDBS Laboratory span several areas of Information Systems, such as Software Engineering Databases, Transactional Systems, Image Databases, Conceptual Modeling, Information System Development, Temporal Databases, Advanced Query Processing and Optimization Techniques. The group's efforts on Spatial Database Systems, include the study of new data structures, storage techniques, retrieval mechanisms and user interfaces for large geographic data bases. In particular, we look at specialized, spatial data structures (R-Trees and their variations) which allow for the direct access of the data based on their spatial properties, and not some sort of encoded representation of the objects' coordinates. We study implementation and optimization techniques of spatial data structures and develop models that make performance estimation. Finally, we are investigating techniques for the efficient representation of relationships and reasoning in space. The activities on Multimedia Database Systems, include the study of advanced data models, storage techniques, retrieval mechanisms and user interfaces for large multimedia data bases. The data models under study include the object-oriented model and the relational model with appropriate extensions to support multimedia data. We are also investigating content-based search techniques for image data bases. In a different direction, we are studying issues involved in the development of multimedia front-ends for conventional, relational data base systems. In the area of Active Database Systems, we are developing new mechanisms for implementing triggers in relational databases. Among the issues involved, we address the problem of efficiently finding qualifying rules against updates in large sets of triggers. This problem is especially critical in database system implementations of triggers, where large amounts of data may have to be searched in order to find out if a particular trigger may qualify to run or not. Continuing work that started at the Foundation for Research and Technology (FORTH), Institute of Computer Science, the group is investigating reuse-oriented approaches to information systems application development. The approaches are based on a repository that has been implemented at FORTH as a special purpose object store, with emphasis on multimodal and fast retrieval. Issues of relating and describing software artifacts (designs, code, etc.) are among the topics under investigation. A new important research direction of the group is on Data Warehouses, which are seen as collections of materialized views captured over a period of time from a heterogeneous distributed information system. Issues such as consistent updates, data warehouse evolution, view reconciliation and data quality are being investigated. Research in Image Databases deals with the retrieval by image content, that uses techniques from the area of Image Processing. We are currently at early stage in this direction, having collected many segmentation and edge detection algorithms, which will be used and evaluated in images of various contents. Our work on Advanced Query Processing and Optimization Techniques includes dynamic or parametric query optimization techniques. In most database systems, the values of many important runtime parameters of the system, the data, or the query are unknown at query optimization time. Dynamic, or parametric, query optimization attempts to identify several execution plans, each one of which is optimal for a subset of all possible values of the run time parameters. In the next sections we present in detail our research efforts on the three main research areas of the KDBS Laboratory: Spatial, Multimedia and Active Databases." SIGMOD Record From the Guest Editors - Special Section on Data Management Issues in Mobile Computing. Margaret H. Dunham,Abdelsalam Helal 1995 From the Guest Editors - Special Section on Data Management Issues in Mobile Computing. SIGMOD Record Mobile Computing and Databases: Anything New? Margaret H. Dunham,Abdelsalam Helal 1995 Mobile Computing and Databases: Anything New? SIGMOD Record The New Middleware. Rich Finkelstein 1995 USING MIDDLEWARE, CUSTOMERS CAN DEPLOY COST-EFFECTIVE AND HIGHLY FUNCTIONAL CLIENT/SERVER APPLICATIONS — ONCE THEY WORK OUT THE KINKS. SIGMOD Record Mapping Extended Entity Relationship Model to Object Modeling Technique. Joseph Fong 1995 "A methodology of reengineering existing extended Entity-Relationship(EER) model to Object Modeling Technique(OMT) model is described. A set of translation rules from EER model to a generic Object-Oriented(OO) model of OMT methodology is devised. Such reengineering practices not only can provide us with significant insight to the ""interoperability"" between the OO and the traditional semantic modelling techniques, but also can lead us to the development of a practical design methodology for object-oriented databases(OODB)." SIGMOD Record Wireless Client/Server Computing for Personal Information Services and Applications. Ahmed K. Elmagarmid,Jin Jing,Tetsuya Furukawa 1995 We are witnessing a profound change in the global information infrastructure that has the potential to fundamentally impact many facets of our life. An important aspect of the evolving infrastructure is the seamless, ubiquitous wireless connectivity which engenders continuous interactions between people and interconnected computers. A challenging area of future ubiquitous wireless computing is the area of providing mobile users with integrated Personal Information Services and Applications (PISA). In this paper, a wireless client/server computing architecture will be discussed for the delivery of PISA. Data management issues such as transactional services and cache consistency will be examined under this architecture. SIGMOD Record Parallelism and its Price: A Case Study of NonStop SQL/MP. Susanne Englert,Ray Glasstone,Waqar Hasan 1995 We describe the use of parallel execution techniques and measure the price of parallel execution in NonStop SQL/MP, a commercial parallel database system from Tandem Computers. Non-Stop SQL uses intra-operator parallelism to parallelize joins, groupings and scans. Parallel execution consists of starting up several processes and communicating data between them. Our measurements show (a) Startup costs are negligible when processes are reused rather than created afresh (b) Communication costs are significant — they may exceed the costs of operators such as scan, grouping or join. We also show two counter-examples to the common intuition that parallel execution reduces response time at the expense of increased work — parallel execution may reduce work or may increase response time depending on communication costs. SIGMOD Record Addressing Techniques Used in Database Object Managers O2 and Orion. André Gamache,Nadjiba Sahraoui 1995 Addressing mechanisms used by the new generation of Data Base Management Systems (DBMS) differ significantly from traditional ones. Such changes are the direct result of new applications requirements such as office information systems (OIS) and computer aided design (CAD). In this context, object format requires different representations on disk and in main memory, and this is often valid for interobject references. It is evident that these mechanisms are closely linked to the mode of object-identity implementation, as well as clustering strategies. All of these functions are controlled by the object manager.This article describes these mechanisms through the implementation of two object managers for object-oriented DBMS: O2 and ORION. We show how the performance of these systems depends on their memory management and addressing scheme. The two managers to be discussed merges techniques proposed by both data base field and object-oriented programming field. Their own mechanism differs, according to the way it handles distribution. ORION-1SX and O2 have a Client/Server architecture, but each one uses a different approach for distributing of functionalities. ORION-1SX implements an object-server, whereas O2 uses a page-server approach. An analysis of the two systems shows that they both use a two-level addressing mechanism. buffer management for objects in memory is diffent and more complex in ORION. On the other hand, the clustering strategies in O2 have the advantage of being more dynamic and can be specified outside the schema. SIGMOD Record An Annotated Bibliography on Active Databases. Ulrike Jaeger,Johann Christoph Freytag 1995 An Annotated Bibliography on Active Databases. SIGMOD Record Implementing Deletion in B+-Trees. Jan Jannink 1995 This paper describes algorithms for key deletion in B+-trees. There are published algorithms and pseudocode for searching and inserting keys, but deletion, due to its greater complexity and perceived lesser importance, is glossed over completely or left as an exercise to the reader. To remedy this situation, we provide a well documented flowchart, algorithm, and pseudo-code for deletion, their relation to search and insertion algorithms, and a reference to a freely available, complete B+-tree library written in the C programming language. SIGMOD Record Information Systems Research at RWTH Aachen. Matthias Jarke 1995 With about 8.000 researchers and 40.000 students, RWTH Aachen is the largest technical university in Europe. The science and engineering departments and their industrial collaborators offer a lot of challenges for database research.The chair Informatik V (Information Systems) focuses on the theoretical analysis, prototypical development, and practical evaluation of meta information systems. Meta information systems, also called repositories, document and coordinate the distributed processes of producing, integrating, operating, and evolving database-intensive applications.Our research approaches these problems from a technological and from an application perspective.On the one hand, we pursue theory and system aspects of the integration of deductive and object-oriented technologies. One outcome of this work is a deductive object manager called ConceptBase which has been developed over the past eight years and is currently used by many research groups and industrial teams throughout the world.On the other hand, a wide range of application-driven projects aims at building a sound basis of empirical knowledge about the demands on meta information systems, and about the quality of proposed solutions. They address application domains as diverse as requirements engineering, telecommunications, cooperative engineering, organization-wide quality management, evolution of chemical production processes, and medical knowledge management. They share the vision of supporting wide-area distributed cooperation not just by low-level interoperation technology but by exploiting conceptual product and process modeling.Under the direction of M. Jarke, Informatik V comprises three research groups with a total of twenty senior researchers and doctoral students: distributed information systems (leader: Dr. Manfred Jeusfeld), process information systems (Dr. Klaus Pohl), and knowledge-based systems (Prof. Wolfgang Nejdl). Database-related activities also exist in the Software Engineering and Applied Mathematics groups. SIGMOD Record A Close Look at the IFO Data Model. Magdy S. Hanna 1995 The IFO data model was proposed by Abiteboul and Hull [Abiteboul 87] as a formalized semantic database model. It has been claimed by the authors that the model subsumes the Relational model [Codd 70], the Entity-Relationship model [Chen 76], the Functional Data Model [Kerschberg 76] and virtually all of the structured aspects of the Semantic Data Model [Hammer 81], the INSYDE Model [King 85], and the Extended Semantic Hierarchy Model [Brodie 84].This paper examines the IFO data model as presented in [Abiteboul 87], compares it to other models, and thus concludes that the IFO data model is actually a subset of the Semantic Data Model proposed by Hammer in [Hammer 81]. The paper also shows that the IFO data model has failed to support concepts that are essential to both the E-R model and the Semantic Data Model which are claimed to be subsumed by the IFO model.Section 2 discusses the three IFO constructs, objects, fragments, and relationships. The mapping of these constructs to constructs in the Semantic Data Model is established as an informal proof of the result that the IFO model is subsumed by the SDM.Section 3 lists constructs supported by the Entity-Relationship model [Chen 76, Teorey 86] as will as constructs supported by SDM [Hammer 81]that the IFO data model fails to support. SIGMOD Record HODFA: An Architectural Framework for Homogenizing Heterogeneous Legacy Database. Kamalakar Karlapalem,Qing Li,Chung-Dak Shum 1995 One of the main difficulties in supporting global applications over a number of localized databases and migrating legacy information systems to modern computing environment is to cope with the heterogeneities of these systems. In this paper, we present a novel flexible architecture (called HODFA) to dynamically connect such localized heterogeneous databases in forming a homogenized federated database system and to support the process of transforming a collection of heterogeneous information systems onto a homogeneous environment. We further develop an incremental methodology of homogenization in the context of our HODFA framework, which can facilitate different degrees of homogenization in a stepwise manner, so that existing applications will not be affected during the process of homogenization. SIGMOD Record Multigranularity Locking in Multiple Job Classes Transaction Processing System. Shan-hoi Ng,Sheung-lun Hung 1995 The conditions of when to apply fine and coarse granularity to different kinds of transaction are well understood. However, it is not very clear how multiple job classes using different lock granularities affect each other. This study aims at exploring the impact of multigranularity locking on the performance of multiple job classes transaction processing system which is common in multiuser database system. There are two key findings in the study. Firstly, lock granularity adopted by identical job classes should not differ from each other by a factor of more than 20; otherwise, serious data contention may result. Secondly, short job class transactions are generally benefited when its level of granularity is similar to that of the long job class since this will reduce the additional lock overhead and data contention which are induced by multigranularity locking. SIGMOD Record Why Decision Support Fails and How To Fix It. Ralph Kimball,Kevin Strehlo 1995 Why Decision Support Fails and How To Fix It. SIGMOD Record On the Issue of Valid Time(s) in Temporal Databases. Stavros Kokkotos,Efstathios V. Ioannidis,Themis Panayiotopoulos,Constantine D. Spyropoulos 1995 Recent research activities in the area of Temporal Databases have revealed some problems related to the definition of time. In this paper we discuss the problem arising from the definition of valid time and the assumptions about valid time, which exist in current Temporal Database approaches. For this problem we propose a solution, while we identify some consistency problems that may appear in Temporal Databases, and which require further investigation. SIGMOD Record Normalization in OODB Design. Byung Suk Lee 1995 When we design an object-oriented database schema, we need to normalize object classes as we do for relations when designing a relational database schema. However, the normalization process for an object class cannot be the same as that of a relation, because of the distinct characteristics of an object-oriented data model such as complex attributes, collection data types, and the usage of object identifiers in place of relational key attributes. We need only one kind of dependency proposed here -- the object functional dependency -- which specifies the dependency of object attributes with respect to the object identifier. We also propose the object normal form of an object class, for which all determinants of object functional dependencies are object identifiers. There is no risk of update anomalies as long as all object classes are in the object normal form. SIGMOD Record An Aspect of Query Optimization in Multidatabase Systems (Extended Abstract). Chiang Lee,Chia-Jung Chen,Hongjun Lu 1995 An Aspect of Query Optimization in Multidatabase Systems (Extended Abstract). SIGMOD Record "Optimizing Jan Jannink's Implementation of B+-tree Deletion" R. Maelbrancke,H. Olivie 1995 In this note we propose optimization strategies for the B+-tree deletion algorithm. The optimizations are focused on even order B+-trees and on the reduction of the number of block accesses. SIGMOD Record A Comparison of Three User Interfaces to Relational Microcomputer Data Bases. Carl Medsker,Margaret Christensen,Il-Yeol Song 1995 PAYOFF IDEA. Different styles of user interfaces can dramatically affect data base capabilities. In an environment comprising many different data bases, the goal is to select one data base management system (DBMS) that provides the best selection of design tools, minimizes development times, and enforces relational rules. This article presents a case study performed at the Hospital of the University of Pennsylvania, in which a test data base was developed for implementation with three DBMSs, each with a distinctly different user and programmer interface. SIGMOD Record A Research Status Report on Adaptation for Mobile Data Access. Brian Noble,Mahadev Satyanarayanan 1995 Mobility demands the systems be adaptive. One approach is to make adaptation transparent to applications, allowing them to remain unchanged. An alternative approach views adaptation as a collaborative partnership between applications and the system. This paper is a status report on our research on both fronts. We report on our considerable experience with application-transparent adaptation in the Coda File System. We also describe our ongoing work on application-aware adaptation in Odyssey.2e SIGMOD Record Multi-Table Joins Through Bitmapped Join Indices. "Patrick E. O'Neil,Goetz Graefe" 1995 This technical note shows how to combine some well-known techniques to create a method that will efficiently execute common multi-table joins. We concentrate on a commonly occurring type of join known as a star-join, although the method presented will generalize to any type of multi-table join. A star-join consists of a central detail table with large cardinality, such as an orders table (where an order row contains a single purchase) with foreign keys that join to descriptive tables, such as customers, products, and (sales) agents. The method presented in this note uses join indices with compressed bitmap representations, which allow predicates restricting columns of descriptive tables to determine an answer set (or foundset) in the central detail table; the method uses different predicates on different descriptive tables in combination to restrict the detail table through compressed bitmap representations of join indices, and easily completes the join of the fully restricted detail table rows back to the descriptive tables. We outline realistic examples where the combination of these techniques yields substantial performance improvements over alternative, more traditional query evaluation plans. SIGMOD Record A Framework for Providing Consistent and Recoverable Agent-Based Access to Heterogeneous in Mobile Databases. Evaggelia Pitoura,Bharat K. Bhargava 1995 A Framework for Providing Consistent and Recoverable Agent-Based Access to Heterogeneous in Mobile Databases. SIGMOD Record Political Winds Change Direction Again. Xiaolei Qian 1995 Political Winds Change Direction Again. SIGMOD Record Turmoil at NASA, and Numerous Funding Announcements. Xiaolei Qian 1995 "Since the last issue of this column six months ago, there have been many interesting program announcements, some of which have already passed deadline. We'll go over these announcements anyway, with the hope that they can get the readers better prepared for future funding opportunities. But first, we'll talk about the continuing budget battle at Congress, and the recent turmoil at NASA." SIGMOD Record Opportunities at ARPA, NSF, and Elsewhere Xiaolei Qian 1995 We first report the relatively minor development on the federal budget. We then touch upon announcements from ARPA, NSF, Defense Nuclear Agency, Rome Laboratory, US Special Operations Command, and Office of National Drug Control Policy. We also report a recent ARPA reorganization. SIGMOD Record Condition Handling in SQL Persistent Stored Modules. Jeff Richey 1995 "The national and international standards committees responsible for Database Language SQL have proposed a candidate extension for SQL Persistent Stored Modules (SQL/PSM). The purpose of this extension is to provide a computationally complete language for the declaration and invocation of SQL stored modules and routines. Typically, such routines are stored in a database Server and executed from an application Client in a Client/Server environment.The proposed SQL/PSM consists of syntax and semantics for variable and cursor declarations, function and procedure (routines) invocations, condition handling, and control statements for looping and branching. An SQL routine is block structured, with each block consisting of local variable and condition handler declarations, a list of SQL statements, and local condition handler execution.Condition handling is a major new feature of SQL/PSM (henceforth referred to as PSM), although the style and comprehensiveness of the specification is still an issue in further progression of the standard. The specification currently under ballot includes conditions for exceptions, warnings, and other completions such as success of no data, and handlers for Continue, Exit, Redo, and Undo.Condition handling allows the user to separate condition handling code from the main flow of a routine, thereby eliminating the need to write numerous short and redundant code fragments to handle each unique condition. In some database products, one cannot even resolve the condition in the Server and must instead resort to the Client application program for resolution. Such approaches are often tedious, error-prone, and inflexible. Condition handling in the SQL module avoids these expensive alternatives, instead allowing the procedure to resolve its own conditions and then resume processing.Condition handling allows one to centralize the handling of conditions and gives users control over two major areas: run-time recovery from failures, and effects of conditions on transactions.Run-time recovery from failures has the following characteristics:• allows a user to handle any run-time condition, either by exiting gracefully or by attempting recovery• provides a recovery mechanism that includes the ability to resolve a condition and then resume action at the statement that caused the condition to be raised (if it was resolved)• provides the ability to define what ""code"" will handle each conditionAfter a condition has been resolved, what is the state of the transaction? Condition handling must ensure that the SQL-data, schemas, and SQL-variables are all maintained in an appropriate stable state and can be committed or rolled back. Additionally, the transaction must comply with the ACID test rules.Thus, the benefits of condition handling include:• allows reduction of error recovery code• creates a model for trapping and resolving conditions• provides the ability to resolve the condition and if possible to continue on• avoids the cost of requiring the SQL-client to resolve the condition• provides for greater data and path consistency in handling conditions• separates one condition from another" SIGMOD Record Florida International University High Performance Database Research Center. Naphtali Rishe,Wei Sun,David Barton,Yi Deng,Cyril U. Orji,Michael Alexopoulos,Leonard Loureiro,Carlos Ordonez,Mario Sanchez,Artyom Shaposhnikov 1995 Florida International University High Performance Database Research Center. SIGMOD Record Data Management Research at The MITRE Corporation. Arnon Rosenthal,Leonard J. Seligman,Catherine D. McCollum,Barbara T. Blaustein,Bhavani M. Thuraisingham,Edward Lafferty 1995 "The MITRE Corporation provides technical assistance, system engineering, and acquisition support to large organizations, especially U.S. Government agencies. We help our customers to plan complex systems based on emerging technologies, and to implement systems based on commercial-off-the-shelf products. In MITRE's research program, instead of emphasizing concerns of DBMS or CASE vendors, our research emphasizes the issues of organizations who need to use such products. For example, we favor areas where we can build over commercial products, rather than changing their internals.Data management at MITRE goes beyond research, to include technology transition, system engineering, product evaluation, prototypes, tutorials, advice on customers' strategic directions, and participation in standards efforts. We use prototyping to illustrate potential improvements in customer systems, to understand vendors' capabilities, or both. There are close connections with efforts in object management, real-time systems, reengineering, artificial intelligence, and security.This paper emphasizes the research efforts, grouped into five major themes: information integration, security and privacy, active and responsive systems, metrics, and digital libraries. For each theme, we list the major questions being explored, and identify projects and contacts for further information." SIGMOD Record The Database Group at University of Hagen (FernUniversitaet). Gunter Schlageter,Thomas Berkel,Eberhard Heuel,Silke Mittrach,Andreas Scherer,Wolfgang Wilkes 1995 The Database Group at University of Hagen (FernUniversitaet). SIGMOD Record "Editor's (Farewell) Notes." Arie Segev 1995 "Editor's (Farewell) Notes." SIGMOD Record Report on The 1995 International Workshop on Temporal Databases. Arie Segev,Christian S. Jensen,Richard T. Snodgrass 1995 This paper provides an overview of the 1995 International Workshop on Temporal Databases. It summarizes the technical papers and related discussions, and three panels: “Wither TSQL3?”, “Temporal Data Management in Financial Applications,” and “Temporal Data Management Infrastructure & Beyond.” SIGMOD Record The Database Group at National Technical University of Athens (NTUA). Timos K. Sellis,Yannis Vassiliou 1995 The Database Group at National Technical University of Athens (NTUA). SIGMOD Record Replication: DB2, Oracle, or Sybase? Doug Stacey 1995 "Is replication salvation or the devil in disguise? Here's what three implementations tell us" SIGMOD Record An Annotated Bibliography on Real-Time Database Systems. Özgür Ulusoy 1995 An Annotated Bibliography on Real-Time Database Systems. SIGMOD Record SQL/CLI - A New Binding Style for SQL. Murali Venkatrao,Michael Pizzo 1995 SQL/CLI - A New Binding Style for SQL. SIGMOD Record "Editor's Notes." Jennifer Widom 1995 "Editor's Notes." SIGMOD Record "Editor's Notes." Jennifer Widom 1995 "Editor's Notes." SIGMOD Record View Maintenance in Mobile Computing. Ouri Wolfson,A. Prasad Sistla,Son Dao,Kailash Narayanan,Ramya Raj 1995 View Maintenance in Mobile Computing. SIGMOD Record "An Introduction to Remy's Fast Polymorphic Record Projection." Limsoon Wong 1995 Traditionally, a record projection is compiled when all fields of the record are known in advance. The need to know all fields in advance leads to very clumsy programs, especially for querying external data sources. In a paper that had not been widely circulated in the database community, Remy presented in programming language context a constant-time implementation of the record projection operation that does not have such a requirement. This paper introduces his technique and suggests an improvement to his technique in the context of database queries. SIGMOD Record Calls for Papers and Announcements. 1995 Calls for Papers and Announcements. SIGMOD Record Calls for Papers and Announcements. 1995 Calls for Papers and Announcements. SIGMOD Record Information Finding in a Digital Library: The Stanford Perspective. Tak W. Yan,Hector Garcia-Molina 1995 In a digital library one of the most challenging problems is finding relevant information. Information finding is the research focus of the Stanford component of the ARPA-sponsored CS-TR Project, and the work has continued as one of the main thrusts in the Stanford Integrated Digital Library project [14]. In this paper we discuss some of the emerging issues in information finding, such as text-database discovery, efficient information dissemination, and copy detection and removal. We also outline our approaches to these issues. SIGMOD Record Application of OODB and SGML Techniques in Text Database: An Electronic Dictionary System. Jian Zhang 1995 "An electronic dictionary system (EDS) is developed with object-oriented database techniques based on ObjectStore. The EDS is composed of two parts: the Database Building Program (DBP), and the Database Querying Program (DQP). DBP reads in a dictionary encoded in SGML tags, and builds a database composed of a collection of trees which holds dictionary entries, and several lists which contain items of various lexical categories. With text exchangeability introduced by the SGML, DBP is able to accommodate dictionaries of different languages with different structures, after easy modification of a configuration file. The tree model, the Category Lists, and an optimization procedure enables DQP to quickly accomplish complicated queries, including context requirements, via simple SQL-like syntax and straightforward search methods. Results show that compared with relational database, DQP enjoys much higher speed and flexibility. With EDS this paper demonstrates how to apply OODBMS's to systems that handle text information with strong yet varied intrinsic hierarchies." ICDE VISUAL: A Graphical Icon-Based Query Language. Nevzat Hurkan Balkir,Eser Sükan,Gultekin Özsoyoglu,Z. Meral Özsoyoglu 1996 VISUAL: A Graphical Icon-Based Query Language. ICDE Tioga-2: A Direct Manipulation Database Visualization Environment. Alexander Aiken,Jolly Chen,Michael Stonebraker,Allison Woodruff 1996 This paper reports on user experience with Tioga, a DBMS-centric visualization tool developed at Berkeley. Based on this experience, we have designed Tioga-2 as a direct manipulation system that is more powerful and much easier to program. A detailed design of the revised system is presented, together with an extensive example of its application. ICDE Advanced Transaction Models in Workflow Contexts. Gustavo Alonso,Divyakant Agrawal,Amr El Abbadi,Mohan Kamath,Roger Günthör,C. Mohan 1996 In recent years, numerous transaction models have been proposed to address the problems posed by advanced database applications, but only a few of these models are being used in commercial products. In this paper, we make the case that such models may be too centered around databases to be useful in real environments. Advanced applications raise a variety of issues that are not addressed at all by transaction models. These same issues, however, are the basis for existing workflow systems, which are having considerable success as commercial products in spite of not having a solid theoretical foundation. We explore some of these issues and show that, in many aspects, workflow models are a superset of transaction models and have the added advantage of incorporating a variety of ideas that to this date have remained outside the scope of traditional transaction processing. ICDE Prefetching from Broadcast Disks. Swarup Acharya,Michael J. Franklin,Stanley B. Zdonik 1996 "Broadcast Disks have been proposed as a means to efficiently deliver data to clients in ``asymmetric'' environments where the available bandwidth from the server to the clients greatly exceeds the bandwidth in the opposite direction. A previous study investigated the use of cost-based caching to improve performance when clients access the broadcast in a demand-driven manner [. achas 95 .]. Such demand-driven access however, does not fully exploit the dissemination-based nature of the broadcast, which is particularly conducive to client {\em prefetching}. With a Broadcast Disk, pages continually flow past the clients so that, in contrast to traditional environments, prefetching can be performed without placing additional load on shared resources. We argue for the use of a simple prefetch heuristic called \PT{} and show that \PT{} balances the cache residency time of a data item with its bandwidth allocation. Because of this tradeoff, \PT{} is very tolerant of variations in the broadcast program. We describe an implementable approximation for \PT{} and examine its sensitivity to access probability estimation errors. The results show that the technique is effective even when the probability estimation is substantially different from the actual values." ICDE A Proposed Method for Creating VCR Functions using MPEG Streams. David B. Andersen 1996 The development of video-on-demand (VOD) systems for movie delivery requires that the user be able to perform VCR functions over a broadband network system. These functions include Play, Pause, Fast Forward, and Fast Rewind. No standard method exists between content developers, server manufacturers and client applications to provide these functions. This paper proposes a standard method for implementing these functions using MPEG streams and discusses some of the important tradeoffs. The encoding and distribution of content has become one of the most important issues facing video information providers. Today, in the case of movies, every service provider must encode the material for the specific equipment being deployed in the network. Therefore, the ease of use and speed of the algorithms employed to encode the material are extremely important. In the future, the creator of the content may encode the material once and distribute it to the service providers in compressed form, but this is not the case today due to the lack of standards. ICDE Dynamic Optimization of Index Scans Restricted by Booleans. Gennady Antoshenkov 1996 Dynamic Optimization of Index Scans Restricted by Booleans. ICDE Order Preserving Compression. Gennady Antoshenkov,David B. Lomet,James Murray 1996 "Order-preserving compression can improve sorting and searching performance, and hence the performance of database systems. We describe a new parsing (tokenization) technique that can be applied to variable-length ""keys"", producing substantial compression. It can both compress and decompress data, permitting variable lengths for dictionary entries and compressed forms. The key notion is to partition the space of strings into ranges, encoding the common prefix of each range. We illustrate our method with padding character compression for multi-field keys, demonstrating the dramatic gains possible. A specific version of the method has been implemented in Digital's Rdb relational database system to enable effective multi-field compression." ICDE The Gold Text Indexing Engine. Daniel Barbará,Sharad Mehrotra,Padmavathi Vallabhaneni 1996 The proliferation of electronic communication including computer mail, faxes, voice mail, and net news has led to a variety of disjoint applications and usage paradigms that forces users to deal with multiple different user interfaces and access related information arriving over the different communication media separately. To enable users to cope with the overload of information arriving over heterogeneous communication media, we have developed the Gold document handling system that allows users to access all of these forms of communication at once, or to intermix them. The Gold system provides users with an integrated way to send and recieve messages using different media, efficiently store the messages, retrieve the messages based on their contents, and to access a variety of other sources of useful information. At the center of the Gold document handling system is the Gold Text Indexing Engine (GTIE) that provides a full text index over the documents. The paper describes our implementation of GTIE and the concurrency control protocols to ensure consistency of the index in the presence of concurrent operations. ICDE OLE DB: A Component DBMS Architecture. José A. Blakeley 1996 "The article describes an effort at Microsoft whose primary goal is to enable applications to have uniform access to data stored in diverse DBMS and non DBMS information containers. Applications continue to take advantage of the benefits of database technology such as declarative queries, transactional access, and security without having to transfer data from its place of origin to a DBMS. Our approach consists of defining an open, extensible collection of interfaces that factor and encapsulate orthogonal, independently reusable portions of DBMS functionality. These interfaces define the boundaries of DBMS components arch as record containers, and query processors that enable uniform, transactional access to data among such components. The proposed interfaces extend Microsoft's OLE Component Object Model (COM) with database functionality, hence these interfaces are collectively referred to as OLE DB. The OLE DB functional areas include data access and updates (rowsets), query processing, catalog information, notifications, transactions, security, and distribution. The article presents an overview of the OLE DB approach and its areas of componentization." ICDE ODMG Update. Dirk Bartels 1996 The Object Database Management Group (ODMG) is a consortium of the leading Object Database (ODMBS) vendors. The consortium was formed in 1992 with the objective to define a standard for the emerging ODBMS industry. Within 18 months, the first release of the standard, so called ODMG-93 was published in October 1993. The following abstract gives a comprehensive overview of the standard and the extension that have been made since the initial publishing. The overview includes the ODMG Object Model, the ODMG Object Definition Language (ODL), the ODMG Object Query Language (OQL), the ODMG C++ binding and the ODMG Smalltalk binding. ICDE "Title, General Chairs' Message, Program Chairs' Message, In Memoriam, Committees, Referees, Author Index." 1996 "Title, General Chairs' Message, Program Chairs' Message, In Memoriam, Committees, Referees, Author Index." ICDE Speculative Data Dissemination and Service to Reduce Server Load, Network Traffic and Service Time in Distributed Information Systems. Azer Bestavros 1996 We present two server-initiated protocols to improve the performance of distributed information systems WWW. Our first protocol is a hierarchical data dissemination mechanism that allows information to propagate from its producers to servers that are closer to its consumers. This dissemination reduces network traffic and balances load amongst servers by exploiting geographic and temporal locality of reference properties exhibited in client access patterns. Our second protocol relies on speculative service, whereby a request for a document is serviced by sending, in addition to the document requested, a number of other documents that the server speculates will be requested in the near future. This speculation reduces service time by exploiting the spatial locality of reference property. We present results of trace-driven simulations that quantify the attainable performance gains for both protocols. ICDE Efficient Processing of Outer Joins and Aggregate Functions. Gautam Bhargava,Piyush Goel,Balakrishna R. Iyer 1996 Removal of redundant outer joins is essential for the reassociation of outer joins with other binary operations. In this paper we present a set of comprehensive algorithms that employ the properties of strong predicates along with the properties of aggregation, intersection, union, and except operations to remove redundant outer joins from a query. For the purpose of query simplification, we generate additional projections by determining the keys. Our algorithm for generating keys is based on a novel concept of weak bindings that is essential for queries containing outer joins. Our algorithm for converting outer joins to joins is based on a novel concept of join-reducibility. ICDE Parallel Processing of Spatial Joins Using R-trees. Thomas Brinkhoff,Hans-Peter Kriegel,Bernhard Seeger 1996 Fachgebiet Informatik, Universität Marburg, Marburg, GermanyIn this paper, we show that spatial joins are very suitable to be processed on a parallel hardware platform. The parallel system is equipped with a so-called shared virtual memory which is well-suited for the design and implementation of parallel spatial join algorithms. We start with an algorithm that consists of three phases: task creation, task assignment and parallel task execution. In order to reduce CPU- and I/O-cost, the three phases are processed in a fashion that preserves spatial locality. Dynamic load balancing is achieved by splitting tasks into smaller ones and reassigning some of the smaller tasks to idle processors. In an experimental performance comparison, we identify the advantages and disadvantages of several variants of our algorithm. The most efficient one shows an almost optimal speed-up under the assumption that the number of disks is sufficiently large. ICDE Parallel Pointer-Based Join Algorithms in Memory-mapped Environments. Peter A. Buhr,Anil K. Goel,Naomi Nishimura,Prabhakar Ragde 1996 Three pointer-based parallel join algorithms are presented and analyzed for environments in which secondary storage is made transparent to the programmer through memory mapping. Buhr, Goel, and Wai have shown that data structures such as B-Trees, R-Trees and graph data structures can be implemented as efficiently and effectively in this environment as in a traditional environment using explicit I/O. Here we show how higher-order algorithms, in particular parallel join algorithms, behave in a memory mapped environment. A quantitative analytical model has been developed to conduct performance analysis of the parallel join algorithms. The model has been validated by experiments. ICDE A Toolkit for Constraint Management in Heterogeneous Information Systems. Sudarshan S. Chawathe,Hector Garcia-Molina,Jennifer Widom 1996 We present a framework and a toolkit to monitor and enforce distributed integrity constraints in loosely coupled heterogeneous information systems. Our framework enables and formalizes weakened notions of consistency, which are essential in such environments. Our framework is used to describe (1) intelfaces provided by a database for the data items involved in inter-site constraints; (2) strategies for monitoring and enforcing such constraints, (3) guarantees regarding the level of consistency the system can provide. Our toolkit uses this framework to provide a set of configurable modules thatare used to monitorand en- force constraints spanning loosely coupled heterogeneous information systems. ICDE An Executable Graphical Representation of Mediatory Information Systems. Jacques Calmet,Dirk Debertin,Sebastian Jekutsch,Joachim Schü 1996 In this paper we present an approach towards a unified modeling and query-processing tool for mediatory information systems. Based upon Coloured Petri nets we are able to model the integration of parametric data (external, uncertain and temporal informations) and to visualize the dataflow in mediatory information systems. ICDE Query Answering Using Discovered Rules. I-Min A. Chen 1996 Query Answering Using Discovered Rules. ICDE Transaction Coordination for the New Millennium: SQL Server Meets OLE Transactions. David Campbell 1996 Transaction Coordination for the New Millennium: SQL Server Meets OLE Transactions. ICDE Secure Mediated Databases. K. Selçuk Candan,Sushil Jajodia,V. S. Subrahmanian 1996 With the evolution of the information superhighway, there is now an immense amount of information available in a wide variety of databases. Furthermore, users often have the ability to access legacy software packages developed by external sources. However, sometimes both the information provided by a data source, as well as one or more of the functions available through a software package may be sensitive-in such cases, organizations require that access by users be controlled. HERMES (HEterogeneous Reasoning and MEdiator System) is a platform that has been developed at the University of Maryland within which mediators may be designed and implemented. HERMES has already been used for a number of applications. In this paper, we provide a formal model of security in mediated systems. We then develop techniques that are sound and complete and respect security constraints of packages/databases participating in the mediated system. The security constraints described an this paper have been implemented, and we describe the existing implementation. ICDE A Transactional Nested Process Management System. Qiming Chen,Umeshwar Dayal 1996 Providing flexible transaction semantics and incorporating activities, data and agents are the key issues in workflow system development. Unfortunately, most of the commercial workflow systems lack the advanced features of transaction models, and an individual transaction model with specific emphasis lacks sufficient coverage for business process management.This report presents our solutions to the above problems in developing Open Process Management System (OPMS) at HP Labs. OPMS is based on nested activity modeling with the following extensions and constraints: in-process open nesting} for extending closed/open nesting to accommodate applications that require improved process-wide concurrency without sacrificing top-level atomicity; confined open as a constraint on open and in-process open activities for avoiding the semantic inconsistencies in activity triggering and compensation; and two-phase remedy as a generalized hierarchical approach for handling failures. ICDE Database Research: Lead, Follow, or Get Out of the Way? - Panel Abstract. Surajit Chaudhuri,Ashok K. Chandra,Umeshwar Dayal,Jim Gray,Michael Stonebraker,Gio Wiederhold,Moshe Y. Vardi 1996 Database Research: Lead, Follow, or Get Out of the Way? - Panel Abstract. ICDE Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. David Wai-Lok Cheung,Jiawei Han,Vincent T. Y. Ng,C. Y. Wong 1996 An incremental updating technique is developed for maintenance of the association rules discovered by database mining. There have been many studies on efficient discovery of association rules in large databases. However, it is nontrivial to maintain such discovered rules in large databases because a database may allow frequent or occasional updates and such updates may not only invalidate some existing strong association rules but also turn some weak rules into strong ones. In this study, an incremental updating technique is proposed for efficient maintenance of discovered association rules when new transaction data are added to a transaction database. ICDE Deferred Updates and Data Placement in Distributed Databases. Parvathi Chundi,Daniel J. Rosenkrantz,S. S. Ravi 1996 Commercial distributed database systems generally support an optional protocol that provides loose consistency of replicas, allowing replicas to be inconsistent for some time. In such a protocol, each replicated data item is assigned a primary copy site. Typically, a transaction updates only the primary copies of data items, with updates to other copies deferred until after the transaction commits. After a transaction commits, its updates to primary copies are sent transactionally to the other sites containing secondary copies. We investigate the transaction model underlying the above protocol. We show that global serializability in such a system is a property of the placement of primary and secondary copies of replicated data items. We present a polynomial time algorithm to assign primary sites to data items so that the resulting topology ensures serializability. ICDE Database Extensions for Complex Domains Samuel DeFazio,Jagannathan Srinivasan 1996 Future versions of the Oracle Server will provide an open and extensible framework for supporting complex data domains including, but not limited to, text, image, spatial, video, and OLAP. This framework encompasses features for defining, storing, updating, indexing, and retrieving complex forms of data with full transaction semantics. The underpinning for these features is an extended Oracle Server that is an object-relational database management system (ORDBMS). ICDE Reusing (Shrink Wrap) Schemas by Modifying Concept Schemas. Lois M. L. Delcambre,Jimmy Langston 1996 "A shrink wrap schema is a well-crafted, complete, global schema that represents an application. A concept schema is a subset of the shrink wrap schema that addresses one particular point of view in an application. We define schema modification operations to customize each concept schema, to match the designer's perception of the application. We maintain the integrated, customized user schema. We enforce consistency checks to provide feedback to the designer about interactions among the concept schemas. We embody these mechanisms in an interactive system that aids in shrink wrap schema-based design. The shrink wrap schema approach promotes reuse of past design efforts; prior approaches to schema reuse do not attempt to reuse an entire schema nor do they focus on local customization. The focus of this paper is on the definition of concept schemas and their corresponding modification operations." ICDE SONET Configuration Management with OpenPM. Weimin Du,Ming-Chien Shan,Chris Whitney 1996 SONET (Synchronous Optical NETwork) has been proposed as the backbone of future information superhighway infrastructure. SONET network management, however, is a complex process that involves many heterogeneous systems and applications, as well as human interactions. In this paper, we describe a prototype system developed at Hewlett-Packard (HP) that provides a service for configuring large SONET networks. The prototype differs from the existing systems in that it employs the HP OpenPM (Open Process Management) workflow system to define, execute and monitor network management processes. Using OpenPM (a middleware service that enables the automation of activities supporting complex enterprise business processes in a distributed heterogeneous computing environment) as a reliable and efficient workflow execution engine, this prototype supports efficient distributed network management and easy integration of legacy applications. The paper describes how an example network configuration management process is modeled, executed and monitored using OpenPM. ICDE Authorization and Access Control in IRO-DB. Wolfgang Eßmayr,Fritz Kastner,Günther Pernul,Stefan Preishuber,A. Min Tjoa 1996 Authorization and Access Control in IRO-DB. ICDE A Log-Structured Organization for Tertiary Storage. Daniel Alexander Ford,Jussi Myllymaki 1996 We present the design of a log-structured tertiary storage system (LTS). The advantage of this approach is that it allows the system to hide the details of juke-box robotics and media characteristics behind a uniform, random access, block-oriented interface. It also allows the system to avoid media mount operations for writes, giving write performance similar to that of secondary storage. ICDE Relaxed Index Consistency for a Client-Server Database. Vibby Gottemukkala,Edward Omiecinski,Umakishore Ramachandran 1996 Client-Server systems cache data in client buffers to deliver good performance. Several efficient protocols have been proposed to maintain the coherence of the cached data. However, none of the protocols distinguish between index pages and data pages. We propose a new coherence protocol, called Relaxed Index Consistency, that exploits the inherent differences in the coherence and concurrency-control (C&CC) requirements for index and data pages. The key idea is to incur a small increase in computation time at the clients to gain a significant reduction in the number of messages exchanged between the clients and the servers. The protocol uses the concurrency control on data pages to maintain coherence of index pages. A performance-conscious implementation of the protocol that makes judicious use of version numbers is proposed. We show, through both qualitative and quantitative analysis, the performance benefits of making the distinction between index pages and data pages for the purposes of C&CC. Our simulation studies show that the Relaxed Index Consistency protocol improves system throughput by as much as 15% to 88%, based on the workload. ICDE The Microsoft Relational Engine. Goetz Graefe 1996 The Microsoft Relational Engine. ICDE Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total. Jim Gray,Adam Bosworth,Andrew Layman,Hamid Pirahesh 1996 Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total. ICDE A Uniform Indexing Scheme for Object-Oriented Databases. Ehud Gudes 1996 The performance of Object-oriented databases (OODB) is a critical factor hindering their current use. Several indexing schemes have been proposed in the literature for enhancing OODB performance and they are briefly reviewed here. In this paper a new and uniform indexing scheme is proposed. This scheme is based on a single B-tree and combines both the hierarchical and nested indexing schemes \cite{Bertino,Kim}. The uniformity of this scheme enables compact and optimized code dealing with a large range of queries on the one hand, and flexibility in adding and removing indexed paths on the other hand. The performance of the scheme is about the same as existing schemes for single-class, exact match or range queries, and much better for multi-class and other complex queries and update. ICDE "Hypermedia Database ``Himotoki'' and Its Applications." Yoshinori Hara,Kyoji Hirata,Hajime Takano,Shigehito Kawasaki 1996 "This paper describes the design concept of a hypermedia database ""Himotoki"" and its navigational capabilities. A hypermedia database is a system that integrates hypermedia operations with database models. Advantages are: (a) its structured design can improve authoring and browsing capabilities for large hypermedia applications, (b) nodes and links can be automatically generated under a certain condition, (c) navigational data interface for DBMS is obtained, etc.For providing cost-effective operations as hypermedia databases, we introduce a set-to-set linking and the navigational functions, i.e., media-based navigation, schema navigation, moving hot-spot navigation. These functional capabilities make media contents well-organized so as to improve the human-machine interactive interface. Implemented applications such as ""Electronic Aquatic Life"" and ""Hypermedia Museum"" demonstrate the usefulness of Himotoki navigational functions and customizability of its architectural design." ICDE Knowledge Discovery from Telecommunication Network Alarm Databases. Kimmo Hätönen,Mika Klemettinen,Heikki Mannila,Pirjo Ronkainen,Hannu Toivonen 1996 A telecommunication network produces daily large amounts of alarm data. The data contains hidden valuable knowledge about the behavior of the network. This knowledge can be used in filtering redundant alarms, locating problems in, the network, and possibly in predicting severe faults. We describe the TASA (Telecommunication Network Alarm Sequence Analyzer) system for discovering and browsing knowledge from large alarm databases. The system is built on the basis of viewing knowledge discovery as an interactive and iterative process, containing data collection, pattern discovery, rule postprocessing, etc. The system uses a novel framework for locating frequently occurring episodes from sequential data. The TASA system offers a variety of selection and ordering criteria for episodes, and supports iterative retrieval from the discovered knowledge. This means that a large part of the iterative nature of the KDD process can be replaced by iteration in the rule postprocessing stage. The user interface is based on dynamically generated HTML. The system is in experimental use, and the results are encouraging: some of the discovered knowledge is being integrated into the alarm handling software of telecommunication operators. ICDE Improving the Performance of Multi-Dimensional Access Structures Based on k-d-Trees. Andreas Henrich 1996 In recent years, various k-d-tree based multi-dimensional access structures have been proposed. All these structures share an average bucket utilization of at most ln 2 (about 69.3 %). In this paper we present two algorithms which perform local redistributions of objects to improve the storage utilization of these access structures. We show that under fair conditions a good improvement algorithm can save up to 20 % of space and up to 15 % of query processing time. On the other hand we also show that a local redistribution scheme designed without care, can improve the storage utilization and at the same time worsen the performance of range queries drastically. Furthermore we show the dependencies between split strategies and local redistribution schemes and the general limitations which can be derived from these dependencies. ICDE Mining Knowledge Rules from Databases: A Rough Set Approach. Xiaohua Hu,Nick Cercone 1996 In this paper, the principle and experimental results of an attribute-oriented rough set approach for knowledge discovery in databases are described. Our method integrates the database operation, rough set theory and machine learning techniques. In our method, we consider the learning procedure consists of two phases: data generalization and data reduction. In data generalization phase, the attribute-oriented induction is performed attribute by attribute using attribute removal and concept ascension, some undesirable attributes to the discovery task are removed and the primitive data is generalized to the desirable level, thus a set of tuples may be generalized to the same generalized tuple, this procedure substantially reduces the computational complexity of the database learning process. Subsequently, in data reduction phase, the rough set method is applied to the generalized relation to find a minimal attribute set relevant to the learning task. The generalized relation is reduced further by removing those attributes which are irrelevant and/or unimportant to the learning task. Finally the tuples in the reduced relation are transformed into different knowledge rules based on different knowledge discovery algorithms. Based upon these principles, a prototype knowledge discovery system DBROUGH has been constructed. In DBROUGH, a variety of knowledge discovery algorithms are incorporated and different kinds of knowledge rules, such as characteristic rules, classification rules, decisions rules, maximal generalized rules can be discovered efficiently and effectively from large databases. ICDE HiTi Graph Model of Topographical Roadmaps in Navigation Systems. Sungwon Jung,Sakti Pramanik 1996 In navigation systems, a primary task is to compute the minimum cost route from the current location to the destination. One of the major problems for navigation systems is that a significant amount of computation time is required to find a minimum cost path when the topographical road map is large. Since navigation systems are real time systems, it is critical that the path be computed while satisfying a time constraint. In this paper, we propose a new graph model named HiTi (Hierarchical mulTi graph model), for efficiently computing an optimal minimum cost path. Based on HiTi graph model, we propose a new single pair minimum cost path algorithm. We empirically show that our proposed algorithm performs far better than the traditional A* algorithm. Further, we empirically analyze our algorithm by varying both edge cost distribution and hierarchical level number of HiTi graphs. ICDE A Next Generation Industry Multimedia Database System. Hiroshi Ishikawa,Koki Kato,Miyuki Ono,Naomi Yoshikawa,Kazumi Kubota,Akiko Kondo 1996 New multimedia applications have emerged on top of information infrastructures, such as on-demand services, digital libraries and museums, online shopping, and document management, which require new databases. That is, next-generation database systems must enable users to efficiently and flexibly develop and execute such advanced multimedia applications. We focus on development of a database system which enables flexible and efficient acquisition, storage, access and retrieval, and distribution and presentation of large amounts of heterogeneous media data. We take an approach based on an object-oriented database, which is more suitable for description of media structures and operations than a traditional relational database. And we extend the object-oriented approach by providing temporal and spatial operators, and control of distributing computing and QOS (quality of service). In this paper, we describe a multimedia data model and its efficient implementation. ICDE Refined Triggering Graphs: A Logic-Based Approach to Termination Analysis in an Active Object-Oriented Database. Anton P. Karadimce,Susan Darling Urban 1996 We present the notion of refined triggering graphs (RTG) for analyzing termination of active rules in object-oriented databases (OODBs). The RTG method consists of mapping the possibility that one active rule can trigger another to the satisfiability of a well-defined logic formula called a triggering formula. The unsatisfiability of the triggering formula is then an indication that the rule triggering possibility is nil. We identify three increasingly more powerful types of triggering formulae and give pointers to the corresponding satisfiability procedures. ICDE Electronic Catalogs - Panel. Arthur M. Keller,Don Brown,Anna-Lena Neches,Sherif Danish,Daniel Barbará 1996 Electronic Catalogs - Panel. ICDE Towards Eliminating Random I/O in Hash Joins. Ming-Ling Lo,Chinya V. Ravishankar 1996 The widening performance gap between CPU and disk is significant for hash join performance. Most current hash join methods try to reduce the volume of data transferred between memory and disk. In this paper, we try to reduce hash-join times by reducing random I/O. We study how current algorithms incur random I/O, and propose a new hash join method, Seq+, that converts much of the random I/O to sequential I/O. Seq+ uses a new organization for hash buckets on disk, and larger input and output buffer sizes. We introduce the technique of batch writes to reduce the bucket-write cost, and the concepts of write- and read-groups of hash buckets to reduce the bucket-read cost. We derive a cost model for our method, and present formulas for choosing various algorithm parameters, including input and output buffer sizes. Our performance study shows that the new hash join method performs many times better than current algorithms under various environments. Since our cost functions under-estimate the cost of current algorithms and over-estimate the cost of Seq+, the actual performance gain of Seq+ is likely to be even greater. ICDE DB2 LOBs: The Teenage Years. Tobin J. Lehman,Patrick Gainer 1996 Previous versions of DB2 Common Server had large objects (LOBs) that were neither large nor functional. Their size was limited to 32,700 bytes and, until recently when support for SUBSTR and CONCAT was added, there was no function available on these objects at all. DB2 LOBs were infants. However, with the latest release of DB2 Common Server, Version 2.1, LOBs have matured considerably, supporting significantly larger sizes and many new language features. To give the reader a feeling for the extent of this new language support, we compare our new SQL LOB language features with that of three other major Relational database competitors: Sybase, Informix and Oracle. Users will find the new DB2 LOBS easy to load and store, easy to search, and easy to integrate into the DB2 user-defined functions (UDFs) and user-defined types (UDTs). In addition, when used in serial mode, the performance of LOB I/O rivals that of file systems and, when used in parallel mode, is a clear winner. DB2 LOBs have now entered the teenage years. ICDE Using Object-Oriented Principles to Optimize Update Propagation to Materialized Views. Harumi A. Kuno,Elke A. Rundensteiner 1996 "View materialization is known to be a valuable technique for performance optimization in relational databases, and much work has been done addressing the problem of consistently maintaining relational views under update operations. However, little progress has been made thus far regarding the topic of view materialization in object-oriented databases (OODBs). In this paper, we demonstrate that there are several significant differences between the relational and object-oriented paradigms that can be exploited when addressing the object-oriented view materialization problem. We can use the subsumption relationships between classes to identify branches of classes to which we do not need to propagate updates. Similarly, we can use encapsulated interfaces combined with the fact that any unique database property is inherited from a single location to provide a ``registration/notification'' service for optimizing incremental view updates. We have successfully implemented all proposed techniques in the MultiView system, which provides updatable materialized classes and virtual schemata on top of the GemStone OODBMS. We also report results from the experimental studies we have run on the MultiView system measuring the impact of various optimization strategies incorporated into our materialization update algorithms." ICDE HierarchyScan: A Hierarchical Similarity Search Algorithm for Databases of Long Sequences. Chung-Sheng Li,Philip S. Yu,Vittorio Castelli 1996 We present a hierarchical algorithm, HierarchyScan, that efficiently locates one-dimensional subsequences within a collection of sequences of arbitrary length. The subsequences identified by HierarchyScan match a given template pattern in a scale- and phase-independent fashion. The idea is to perform correlation between the stored sequences and the template in the transformed domain hierarchically. Only those subsequences whose maximum correlation value is higher than a predefined threshold will be selected. The performance of this approach is compared to the sequential scanning and an order-of-magnitude speedup is observed. ICDE The Ode Active Database: Trigger Semantics and Implementation. Daniel F. Lieuwen,Narain H. Gehani,Robert M. Arlein 1996 Triggers are the basic ingredient of active databases. Ode triggers are event-action pairs. An event can be a composite event (i.e., an event composed from other events). Composite events are detected by translating the event specifications into finite state machines. In this paper, we describe the integration and implementation of composite event based triggers into the Ode object database. We focus on implementation details such as the basic trigger events supported, the efficient posting of these events, the handling of transaction-related events, and the integration of triggers into a real database. We also describe the run-time facilities used to support trigger processing and describe some experiences we gained while implementing triggers. We illustrate Ode trigger facilities with a credit card example. ICDE A Distributed Query Processing Strategy Using Placement Dependency. Chengwen Liu,Hao Chen,Warren Krueger 1996 We present an algorithm to make use of placement dependency information to process distributed queries. Our algorithm first partitions the referenced relations of a given query into a number of non-exclusive subsets such that the fragmented relations within a subset have placement dependency and the join operation(s) associated with the relations in the subset can be locally processed without data transfer. Each subset is associated with a set of sites and can be used to generate an execution plan for the given query by keeping the fragmented relations in the subset fragmented at the sites where they are situated while replicating the other referenced relations at each of the processing sites. Among the alternatives, our algorithm picks the plan that gives the minimum response time. Our experimental results show that our algorithm improves response time significantly. ICDE Are We Moving Toward an Information SuperHighway or a Tower of Babel? The Challenge of Large-Scale Semantic Heterogeneity. Stuart E. Madnick 1996 "The popularity and growth of the ""Information Super Highway"" have dramatically increased the number of information sources available for use. Unfortunately, there are significant challenges to be overcome. One particular problem is context interchange, whereby each source of information and potential receiver of that information may operate with a different context, leading to large-scale semantic heterogeneity. A context is the collection of implicit assumptions about the context definition (i.e., meaning) and context characteristics (i.e., quality) of the information. This paper describes various forms of context challenges and examples of potential context mediation services, such as data semantics acquisition, data quality attributes, and evolving semantics and quality, that can mitigate the problem." ICDE ActionWorkflow in Use: Clark County Department of Business License. Raul Medina-Mora,Kelly W. Cartron 1996 In this paper we present the basic concepts of ActionWorkflow and a study of a successful implementation in Clark County Department of Business License. The Image/Workflow System reengineers a labyrinthine licensing system into simplistic processes that are more customer oriented, yield superior productivity, establish a work-in-progress tracking mechanism, and archive the resulting licensing processes permanently on an unalterable optical storage system. ICDE SQL3 Update. Jim Melton 1996 The third major generation of the standard for the SQL database language, known as SQL3, is nearing completion. Many significant new features have been added to SQL, both in areas traditionally addressed by SQL and in pursuit of adding object technology to the language. The standard has been partitioned into a number of distinct parts, each of which may progress at its own rate. Publication of SQL3 as a replacement for the current version of the standard, SQL-92, is expected no sooner than 1998. ICDE Performance Analysis of Several Algorithms for Processing Joins between Textual Attributes. Weiyi Meng,Clement T. Yu,Wei Wang,Naphtali Rishe 1996 Three algorithms for processing joins on attributes of textual type are presented and analyzed in this paper. Since such joins often involve document collections of very large size, it is very important to find efficient algorithms to process them. The three algorithms differ on whether the documents themselves or the inverted files on the documents are used to process the join. Our analysis and the simulation results indicate that the relative performance of these algorithms depends on the input document collections, system characteristics and the input query. For each algorithm, the type of input document collections with which the algorithm is likely to perform well is identified. ICDE A Groupware Benchmark Based on Lotus Notes. Kenneth Moore,Michelle Peterson 1996 In this paper, we propose a new benchmark for groupware systems. It incorporates elements of previous messaging, text retrieval, and database benchmarks. The benchmark is based on groupware functions found in Lotus Notes, but should be adaptable by any groupware system. ICDE Automating the Assembly of Presentations from Multimedia Databases. Gultekin Özsoyoglu,Veli Hakkoymaz,Joel Kraft 1996 A multimedia presentation refers to the presentation of multimedia data using output devices such as monitors for text and video, and speakers for audio. Each presentation consists of multimedia segments which are obtained from a multimedia data model. In this paper, we propose to express semantic coherency of a multimedia presentation in terms of presentation inclusion and exclusion constraints that are incorporated into the multimedia data model. Thus, when a user specifies a set of segments for a presentation, the DBMS adds segments into and/or deletes segments from the set in order to satisfy the inclusion and exclusion constraints. To automate the assembly of a presentation with concurrent presentation streams, we also propose presentation organization constraints that are incorporated into the multimedia data model, independent of any presentation. We give two algorithms for automated presentation assembly and discuss their complexity. We discuss the satisfiability of inclusion and exclusion constraints when negation is allowed. And, we briefly describe a prototype system that is being developed for automated presentation assembly. ICDE Auditory Browsing for Acquisition of Information in Cyberspace. Naoto Oki,Kunio Teramoto,Ken-ichi Okada,Yutaka Matsushita 1996 Today, various novel telecommunication systems have been proposed. The idea of a virtual communication space shared by multiple distributed users via networks, so-called cyberspace, has received considerable communication. In cyberspace, there are tremendously many users, on-line services, and much information, and users have great opportunity to encounter all of these. We think the most serious problem in cyberspace is the great risk that a user might miss a chance to get relevant or important information. In this paper, we propose a strategy for auditory browsing to address this problem, using a spatial sound interface. We implemented VCP (Virtual Cocktail Party), an experimental system for achieving efficient and flexible telecommunication and data retrieval, which takes advantage of human auditory capability. This system can support a number of physically separated users in a single shared sound cyberspace and consists of distributed terminals with a spatial sound interface. ICDE Client-Based Logging for High Performance Distributed Architectures. Euthimios Panagos,Alexandros Biliris,H. V. Jagadish,Rajeev Rastogi 1996 In this paper, we propose logging and recovery algorithms for distributed architectures that use local disk space to provide transactional facilities locally. Each node has its own log file where all log records for updates to locally cached pages are written. Transaction rollback and node crash recovery are handled exclusively by each node and log files are not merged at any time. Our algorithms do not require any form of time synchronization between nodes and nodes can take checkpoints independently of each other. Finally, our algorithms make possible a new paradigm for distributed transaction management that has the potential to exploit all available resources and improve scalability and performance. ICDE MedMaker: A Mediation System Based on Declarative Specifications. Yannis Papakonstantinou,Hector Garcia-Molina,Jeffrey D. Ullman 1996 Mediators are used for integration of heterogeneous information sources. We present a system for declaratively specifying mediators. It is targeted for integration of sources with unstructured or semi-structured data and/or sources with changing schemas. We illustrate the main features of the Mediator Specification Language (MSL), show how they facilitate integration, and describe the implementation of the system that interprets the MSL specifications. ICDE A Directory Service for a Federation of CIM Databases with Migrating Objects. Ajit K. Patankar,Arie Segev,J. George Shanthikumar 1996 We propose a novel directory scheme for a large federation of databases where object migration is in response to manufacturing events. In our directory scheme, objects report their location to a directory server instead of the traditional method of the directory servers polling sites in the network. The directory is distributed among multiple servers to avoid bottleneck during query processing. A distributed Linear Hashing algorithm is proposed for efficiently determining an appropriate server for an object. Finally, a stochastic dynamic programming model is proposed for minimizing the number of database transactions. ICDE Towards the Reverse Engineering of Denormalized Relational Databases. Jean-Marc Petit,Farouk Toumani,Jean-François Boulicaut,Jacques Kouloumdjian 1996 This paper describes a method to cope with denormalized relational schemas in a database reverse engineering process. We propose two main steps to improve the understanding of data semantics. Firstly we extract inclusion dependencies by analyzing the equi-join queries embedded in application programs and by querying the database extension. Secondly we show how to discover only functional dependencies which influence the way attributes should be restructured. The method is interactive since an expert user has to validate the presumptions on the elicited dependencies. Moreover, a restructuring phase leads to a relational schema in third normal form provided with key dependencies and referential integrity constraints. Finally, we sketch how an Entity-Relationship schema can be derived from such information. ICDE Query Folding. Xiaolei Qian 1996 Query folding refers to the activity of determining if and how a query can be answered using a given set of resources, which might be materialized views, cached results of previous queries, or queries answerable by other databases. We investigate query folding in the context where queries and resources are conjunctive queries. We develop an exponential-time algorithm that finds all complete or partial foldings, and a polynomial-time algorithm for the subclass of acyclic conjunctive queries. Our results can be applied to query optimization in centralized databases, to query processing in distributed databases, and to query answering in federated databases. ICDE A Hybrid Object Clustering Strategy for Large Knowledge-Based Systems. Arun Ramanujapuram,Jim E. Greer 1996 Object bases underlying knowledge-based applications tend to be complex and require management. This research aims at improving the performance of object bases underlying a class of large knowledge-based systems that utilize object-oriented technology to engineer the knowledge base. In this paper, a hybrid clustering strategy that beneficially combines semantic clustering and iterative graph-paritioning techniques has been developed and evaluated for use in knowledge bases storing information in the form of object graphs. It is demonstrated via experimentation that such a technique is useful and feasible in realistic object bases. A semantic specification mechanism similar to placement trees has been developed for specifying the clustering. The workload and the nature of object graphs in knowledge bases differ significantly from those present in conventional object-oriented databases. Therefore, the evaluation has been performed by building a new benchmark called the Granularity Benchmark. A segmented storage scheme for the knowledge base using large object storage mechanisms of existing storage managers is also examined. ICDE Evaluation and Optimization of the LIVING IN A LATTICE Rule Language. Holger Riedel,Andreas Heuer 1996 We introduce an evaluation technique for a declarative OODB query language. The query language is rule-based and can be evaluated and optimized using an appropriate object algebra. We introduce a new framework which uses concepts of the object-oriented data model to define adequate accesses to the database. Additionally, the problems according to the evaluation of recursive queries are discussed. ICDE Complex Query Decorrelation. Praveen Seshadri,Hamid Pirahesh,T. Y. Cliff Leung 1996 Complex queries used in decision support applications use multiple correlated subqueries and table expressions, possibly across several levels of nesting. It is usually inefficient to directly execute a correlated query; consequently, algorithms have been proposed to decorrelate the query, i.e., to eliminate the correlation by rewriting the query. This paper explains the issues involved in decorrelation, and surveys existing algorithms. It presents an efficient and flexible algorithm called magic decorrelation which is superior to existing algorithms both in terms of the generality of application, and the efficiency of the rewritten query. The algorithm is described in the context of its implementation in the Starburst Extensible Database System, and its performance is compared with other decorrelation techniques. The paper also explains why magic decorrelation is not merely applicable, but crucial in a parallel database system. ICDE Approximate Queries and Representations for Large Data Sequences. Hagit Shatkay,Stanley B. Zdonik 1996 Many new database application domains such as experimental sciences and medicine are characterized by large sequences as their main form of data. Using approximate representation can significantly reduce the required storage and search space. A good choice of representation, can support a broad new class of approximate queries, needed in these domains. These queries are concerned with application-dependent features of the data as opposed to the actual sampled points. We introduce a new notion of generalized approximate queries and a general divide and conquer approach that supports them. This approach uses families of real-valued functions as an approximate representation. We present an algorithm for realizing our technique, and the results of applying it to medical cardiology data. ICDE DSDT: Durable Scripts Containing Database Transactions. Betty Salzberg,Dimitri Tombroff 1996 "DSDT is a proposed method for creating durable scripts which contain short ACID transactions as components. Workflow scripts are an example. The context of the script is made durable by writing a log record whenever an event occurs which cannot be replayed. Log checkpoints are used to minimize recovery time. DSDT can be written in stand-alone mode communicating with DBMSs by transactional remote procedure calls and maintaining its own logging system or it can be made part of a DBMS by modifying the DBMS transaction manager source code. DSDT provides a panic button (signal-exit) and the ability to specify what action should be taken on restart after system failure. The programmer can also specify actions such as ""compensation"" transactions to be taken after another signal (signal-cancel) arrives. DSDT enables most extended transaction models to be expressed in scripts modulo the guarantees of compensation. Recovery after system failure is shown to be correct." ICDE Workflow and Data Management in InConcert. Sunil K. Sarin 1996 InConcert is an object-oriented client-server workflow management system. An overview is provided of the functionality of InConcert and how it is implemented on an underlying relational database management system. Data management issues in supporting distributed workflow are briefly reviewed. ICDE "What's in a WWW Link? - Panel." Amit P. Sheth,Robert Meersman,Erich J. Neuhold,Calton Pu,V. S. Subrahmanian 1996 "What's in a WWW Link? - Panel." ICDE A Graph-Theoretic Approach to Indexing in Object-Oriented Databases. Boris Shidlovsky,Elisa Bertino 1996 A graph-theoretic approach to the path indexing problem is proposed. We represent the indexing relationships supported by indices allocated in the classes in the path in the form of a directed graph. All the previous approaches directly fit into the scheme and form a hierarchy of complexity with respect to the time required for selection of the optimal index configuration. Based on the general scheme, we develop a new approach to the path indexing problem exploiting the notion of visibility graph. We introduce a generalized nested-inherited index, give algorithms for retrieval and update operations and compare the behavior of the new structure with previous approaches. ICDE Data Replication in Mariposa. Jeff Sidell,Paul M. Aoki,Adam Sah,Carl Staelin,Michael Stonebraker,Andrew Yu 1996 The Mariposa distributed data manager uses an economic model for managing the allocation of both storage objects and queries to servers. We present extensions to the economic model which support replica management, as well as our mechanisms for propagating updates among replicas. We show how our replica control mechanism can be used to provide consistent, although potentially stale, views of data across many machines without expensive per-transaction synchronization. We present a rule-based conflict resolution mechanism, which can be used to enhance traditional time-stamp serialization. We discuss the effects of our replica system on query processing for both read-only and read-write queries. We further demonstrate how the replication model and mechanisms naturally support name service in Mariposa. ICDE Synthesizing Distributed Constrained Events from Transactional Workflow. Munindar P. Singh 1996 Workflows are the semantically appropriate composite activities in heterogeneous computing environments. Such environments typically comprise a great diversity of locally autonomous databases, applications, and interfaces. Much good research has focused on the semantics of workflows, and how to capture them in different extended transaction models. Here we address the complementary issues pertaining to how workflows may be declaratively specified, and how distributed constraints may be derived from those specifications to enable local control, thus obviating a centralized scheduler. Previous approaches to this problem were limited and often lacked a formal semantics. ICDE Using Partial Differencing for Efficient Monitoring of Deferred Complex Rule Conditions. Martin Sköld,Tore Risch 1996 Presents a difference calculus for determining changes to rule conditions in an active DBMS. The calculus has been used for implementing an algorithm to efficiently monitor rules with complex conditions. The calculus is based on partial differencing of queries derived from rule conditions. For each rule condition, several partially differentiated queries are generated that each considers changes to a single base relation or view that the condition depends on. The calculus considers both insertions and deletions. The algorithm is optimized for deferred rule condition monitoring in transactions with few updates. The calculus allows us to optimize both space and time. Space optimization is achieved since the calculus and the algorithm does not presuppose materialization of monitored conditions to find its previous state. This is achieved by using a breadth-first, bottom-up propagation algorithm and by calculating previous states by doing a logical rollback. Time optimization is achieved through incremental evaluation techniques. The algorithm has been implemented and a performance study is presented at the end of the paper. ICDE Consistency and Performance of Concurrent Interactive Database Applications. Konstantinos Stathatos,Stephen Kelley,Nick Roussopoulos,John S. Baras 1996 In many modern database applications, there is an emerging need for interactive environments where users directly manipulate the contents of the database. Graphical user interfaces (GUIs) display images of the database which must reflect a consistent up--to--date state of the data with minimum perceivable delay to the user. Moreover, the possibility of several applications concurrently displaying different views of the same database increases the overall system complexity. In this paper, we show how design, performance and concurrency issues can be addressed by adapting existing database techniques. We propose the use of suitable display schemas whose instances compose active views of the database, an extended client caching scheme which is expected to yield significant performance benefits and a locking mechanism that maintains consistency between the GUIs and the database. ICDE High Availability in Clustered Multimedia Servers. Renu Tewari,Daniel M. Dias,Rajat Mukherjee,Harrick M. Vin 1996 Clustered multimedia servers, consisting of interconnected nodes and disks, have been proposed for large scale servers, that are capable of supporting multiple concurrent streams which access the video objects stored in the server. As the number of disks and nodes in the cluster increases, so does the probability of a failure. With data striped across all disks in a cluster, the failure of a single disk or node, results in the disruption of many or all streams in the system. Guaranteeing high availability in such a cluster becomes a primary requirement, to ensure continuous service. In this paper, we study mirroring and software RAID schemes with different placement strategies, that guarantee high availability in the event of disk and node failures, while satisfying the real-time requirements of the streams. We examine various declustering techniques for spreading the redundant information across disks and nodes and show that random declustering has good real-time performance. Finally, we compare the overall cost per stream for different system configurations. We derive the parameter space where mirroring and software RAID apply, and determine optimal parity group sizes. ICDE Delta-Sets for Optimized Reactive Adaptive Playout Management in Distributed Multimedia Database Systems. Heiko Thimm,Wolfgang Klas 1996 A novel database system service called playout management service which performs multimedia presentations was proposed recently. In distributed multimedia database systems without end-to-end guarantees, such a playout management service faces the potential problem that system performance can become insufficient when realizing a stored presentation. This problem can be overcome by making the playout management service reactive such that it balances the data amount to be fetched from a remote multimedia database with the system performance available. For the users this means that, a running presentation is adapted by the playout management service. In this paper, we propose the concept of delta-set to adapt the execution of arbitrary multimedia presentations in an optimized way. We show a heuristic scheme to identify the most adequate delta-set with respect to (1) the actual system state, (2) the user preferences, and (3) the specific properties of multimedia presentations. ICDE PICSDesk: A Case Study on Business Process Re-engineering. Manolis M. Tsangaris,Madhur Kohli,Shamim A. Naqvi,Richard Nunziata,Yatin P. Saraiya 1996 Presents first-hand experiences from an actual re-engineering project. Business re-engineering is a process affecting not only the software system involved, but the underlying business model as well. Indeed, it is the changed business model along with the new technologies that determine the design of the new system. This paper is a walkthrough of the design of PICSDESK, a prototype incorporating some modern technologies to support the old business model of its predecessor and its evolution to a new business model. PICSDESK may be thought of as an example of a new breed of inventory control systems. ICDE Applying a Flexible OODBMS-IRS-Coupling for Structured Document Handling. Marc Volz,Karl Aberer,Klemens Böhm 1996 In document management systems it is desirable to provide content-based access to documents going beyond regular expression search in addition to access based on structural characteristics or associated attributes. We present a new approach for coupling OODBMSs (Object Oriented Database Management Systems) and IRSs (Information Retrieval Systems) that provides enhanced flexibility and functionality as compared to coupling approaches reported from the literature. Our approach allows to decide freely to which document collections, that are used as retrieval context, document objects belong, which text contents they provide for retrieval and how they derive their associated retrieval values, either directly from the retrieval machine or from the values of related objects. Especially, we show how in this approach different strategies can be applied to hierarchically structured documents, possibly avoiding redundancy and IRS or OODBMS peculiarities. Content-based and structural queries can be freely combined within the OODBMS query language. ICDE The Mentor Project: Steps Toward Enterprise-Wide Workflow Management. Dirk Wodtke,Jeanine Weißenfels,Gerhard Weikum,Angelika Kotz Dittrich 1996 Enterprise-wide workflow management where workflows may span multiple organizational units require particular consideration of scalability, heterogeneity, and availability issues. The Mentor project which is introduced in this paper aims to reconcile a rigorous workflow specification method with a distributed middleware architecture as a step towards enterprise-wide solutions. The project uses the formalism of state and activity charts and a commercial tool, Statemate, for workflow specification. A first prototype of Mentor has been built which allows executing specifications in a distributed manner A major contribution of this paper is the method for transforming a centralized state chart spectfication into a form that is amenable to a distributed execution and to incorporate the necessary synchronization between different processing entities. Fault tolerance issues are addressed by coupling Mentor with the Tuxedo TP monitor. ICDE Representing Retroactive and Proactive Versions in Bi-Temporal Databases. Jongho Won,Ramez Elmasri 1996 Bi-Temporal databases allow users to record retroactive (past) and proactive (future planned) versions of an entity, and to retrieve the appropriate versiosns for bi-temporal queries that involve both valid-time and transaction-time. Currently used timestamp representations are mainly for either valid-time or transaction-time databases. In this paper, we first categorize the types of problems that can occur in existing models. These are (1) ambiguity, (2) priority specification, and (3) lost information. We then propose a 2TDB model that allows both retroactive and proactive versions, overcomes the identified problems, and permits the correction of recorded facts. ICDE Energy-Efficient Caching for Wireless Mobile Computing. Kun-Lung Wu,Philip S. Yu,Ming-Syan Chen 1996 Caching can reduce the bandwidth requirement in a mobile computing environment. However, due to battery power limitations, a wireless mobile computer may often be forced to operate in a doze or even totally disconnected mode. As a result, the mobile computer may miss some cache invalidation reports broadcasted by a server, forcing it to discard the entire cache contents after waking up. In this paper, we present an energy-efficient cache invalidation method, called GCORE, that allows a mobile computer to operate in a disconnected mode to save battery while still retaining most of the caching benefits after a reconnection. We present an efficient implementation of GCORE and conduct simulations to evaluate its caching effectiveness. The results show that GCORE can substantially improve mobile caching by reducing the communication bandwidth (or energy consumption) for query processing. ICDE Similarity Indexing with the SS-tree. David A. White,Ramesh Jain 1996 "Efficient indexing of high dimensional feature vectors is important to allow visual information systems and a number other applications to scale up to large databases. In this paper, we define this problem as ""similarity indexing"" and describe the fundamental types of ""similarity queries"" that we believe should be supported. We also propose a new dynamic structure for similarity indexing called the similarity search tree or SS-tree. In nearly every test we performed on high dimensional data, we found that this structure performed better than the R*-tree. Our tests also show that the SS-tree is much better suited for approximate queries than the R*-tree." ICDE Search and Ranking Algorithms for Locating Resources on the World Wide Web. Budi Yuwono,Dik Lun Lee 1996 Applying information retrieval techniques to the World Wide Web (WWW) environment is a challenge, mostly because of its hypertext/hypermedia nature and the richness of the meta-information it provides. We present four keyword-based search and ranking algorithms for locating relevant WWW pages with respect to user queries. The first algorithm, Boolean Spreading Activation, extends the notion of word occurrence in the Boolean retrieval model by propagating the occurrence of a query word in a page to other pages linked to it. The second algorithm, Most-cited, uses the number of citing hyperlinks between potentially relevant WWW pages to increase the relevance scores of the referenced pages over the referencing pages. The third algorithm, TFxIDF vector space model, is based on word distribution statistics. The last algorithm, Vector Spreading Activation, combines TFxIDF with the spreading activation model. We conducted an experiment to evaluate the retrieval effectiveness of these algorithms. From the results of the experiment, we draw conclusions regarding the nature of the WWW environment with respect to document ranking strategies. ICDE Transaction Management for a Distributed Object Storage System WAKASHI - Design, Implementation and Performance. Ge Yu,Kunihiko Kaneko,Guangyi Bai,Akifumi Makinouchi 1996 "This paper presents the transaction management in a high performance distributed object storage system WAKASHI. Unlike other systems that use centralized client/server architecture and offer conventional buffer management for distributed persistent object management, WAKASHI is based on symmetric peer-peer architecture and employs memory-mapping and distributed shared virtual memory techniques. Several novel techniques on transaction management for WAKASHI are developed. First, a multi-threaded transaction manager offers ``multi-threaded connection'' so that data control and transaction operations can be performed in parallel manner. Secondly, a concurrency control mechanism supports transparent page-level locks to reduce the complexity of user programs and locking overhead. Thirdly, a ``compact commit'' method is proposed to minimize the communication cost by reducing the amount of data and the number of connections. Fourthly, a redo-only recovery method is implemented by ``shadowed cache'' method to minimize the logging cost, and to allow fast recovery and system restart. Moreover, the system offers ``hierarchical'' control to support nested transactions. A performance evaluation by the OO7 benchmark is presented, as well." SIGMOD Conference A Super Scalar Sort Algorithm for RISC Processors. Ramesh C. Agarwal 1996 "The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism which can be exploited by multiple execution units, significantly reduce the number of passes through keys, and improve data locality. These new algorithms outperform traditional sort algorithms by a large factor.For the Datamation disk to disk sort benchmark (one million 100-byte records), at SIGMOD'94, Chris Nyberg et al presented several new performance records using DEC alpha processor based systems.We have implemented the Datamation sort benchmark using our new sort algorithm on a desktop IBM RS/6000 model 39H (66.6 MHz) with 8 IBM SSA 7133 disk drives (total cost $73K). The total elapsed time for the 100 MB sort was 5.1 seconds (vs the old uni-processor record of 9.1 seconds). We have also established a new price performance record (0.2¢ vs the old record of 0.9¢, as the cost of the sort). The entire sort processing was overlapped with I/O. During the read phase, we achieved a sustained BW of 47 MB/sec and during the write phase, we achieved a sustained BW of 39 MB/sec. Key extraction and sorting of one million 10-byte keys took only 0.6 second of CPU time. The rest of the CPU time was used in moving records, servicing I/O, and other overheads.Algorithmic details leading to this level of performance are described in this paper. A detailed analysis of the CPU time spent during various phases of the sort algorithm and I/O is also provided." SIGMOD Conference Two Techniques for On-Line Index Modification in Shared Nothing Parallel Databases. Kiran J. Achyutuni,Edward Omiecinski,Shamkant B. Navathe 1996 Whenever data is moved across nodes in the parallel database system, the indexes need to be modified too. Index modification overhead can be quite severe because there can be a large number of indexes on a relation. In this paper, we study two alternatives to index modification, namely OAT (One-At-a-Time page movement) and BULK (bulk page movement). OAT and BULK are two extremes on the spectrum of the granularity of data movement. OAT and BULK differ in two respects: first, OAT uses very little additional disk space (at most one extra page), whereas BULK uses a large amount of disk space. Second, BULK uses sequential prefetch I/O to optimize on the number of I/Os during index modification, while OAT does not. Using an experimental testbed, we show that BULK is an order of magnitude faster than OAT. In terms of the impact on transaction performance during reorganization, BULK and OAT perform differently: when the number of indexes to be modified is either one or two, OAT has a lesser impact on the transaction performance degradation. However, when the number of indexes is greater than two, both techniques have the same impact on transaction performance. SIGMOD Conference Query Caching and Optimization in Distributed Mediator Systems. Sibel Adali,K. Selçuk Candan,Yannis Papakonstantinou,V. S. Subrahmanian 1996 "Query processing and optimization in mediator systems that access distributed non-proprietary sources pose many novel problems. Cost-based query optimization is hard because the mediator does not have access to source statistics information and furthermore it may not be easy to model the source's performance. At the same time, querying remote sources may be very expensive because of high connection overhead, long computation time, financial charges, and temporary unavailability. We propose a cost-based optimization technique that caches statistics of actual calls to the sources and consequently estimates the cost of the possible execution plans based on the statistics cache. We investigate issues pertaining to the design of the statistics cache and experimentally analyze various tradeoffs. We also present a query result caching mechanism that allows us to effectively use results of prior queries when the source is not readily available. We employ the novel invariants mechanism, which shows how semantic information about data sources may be used to discover cached query results of interest." SIGMOD Conference Data Mining Techniques. Jiawei Han 1996 Data mining, or knowledge discovery in databases, has been popularly recognized as an important research issue with broad applications. We provide a comprehensive survey, in database perspective, on the data mining techniques developed recently. Several major kinds of data mining methods, including generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization, and meta-rule guided mining, will be reviewed. Techniques for mining knowledge in different kinds of databases, including relational, transaction, object-oriented, spatial, and active databases, as well as global information systems, will be examined. Potential data mining applications and some research issues will also be discussed. SIGMOD Conference Repository System Engineering. Philip A. Bernstein 1996 Repository System Engineering. SIGMOD Conference BeSS: Storage Support for Interactive Visualization Systems. "Alexandros Biliris,Thomas A. Funkhouser,William O'Connell,Euthimios Panagos" 1996 BeSS: Storage Support for Interactive Visualization Systems. SIGMOD Conference Data Access for the Masses through OLE DB. José A. Blakeley 1996 "This paper presents an overview of OLE DB, a set of interfaces being developed at Microsoft whose goal is to enable applications to have uniform access to data stored in DBMS and non-DBMS information containers. Applications will be able to take advantage of the benefits of database technology without having to transfer data from its place of origin to a DBMS. Our approach consists of defining an open, extensible Collection of interfaces that factor and encapsulate orthogonal, reusable portions of DBMS functionality. These interfaces define the boundaries of DBMS components such as record containers, query processors, and transaction coordinators that enable uniform, transactional access to data among such components. The proposed interfaces extend Microsoft's OLE/COM object services framework with database functionality, hence these interfaces are collectively referred to as OLE DB. The OLE DB functional areas include data access and updates (rowsets), query processing, schema information, notifications, transactions, security, and access to remote data. In a sense, OLE DB represents an effort to bring database technology to the masses. This paper presents an overview of the OLE DB approach and its areas of componentization." SIGMOD Conference An Open Storage System for Abstract Objects. Stephen Blott,Lukas Relly,Hans-Jörg Schek 1996 An Open Storage System for Abstract Objects. SIGMOD Conference HyperStorM - Administering Structured Documents Using Object-Oriented Database Technology. Klemens Böhm,Karl Aberer 1996 HyperStorM - Administering Structured Documents Using Object-Oriented Database Technology. SIGMOD Conference Goal-Oriented Buffer Management Revisited. Kurt P. Brown,Michael J. Carey,Miron Livny 1996 In this paper we revisit the problem of achieving multi-class workload response time goals by automatically adjusting the buffer memory allocations of each workload class. We discuss the virtues and limitations of previous work with respect to a set of criteria we lay out for judging the success of any goal-oriented resource allocation algorithm. We then introduce the concept of hit rate concavity and develop a new goal-oriented buffer allocation algorithm, called Class Fencing, that is based on this concept. Exploiting the notion of hit rate concavity results in an algorithm that not only is as accurate and stable as our previous work, but also more responsive, more robust, and simpler to implement. SIGMOD Conference A Query Language and Optimization Techniques for Unstructured Data. Peter Buneman,Susan B. Davidson,Gerd G. Hillebrand,Dan Suciu 1996 "A new kind of data model has recently emerged in which the database is not constrained by a conventional schema. Systems like ACeDB, which has become very popular with biologists, and the recent Tsimmis proposal for data integration organize data in tree-like structures whose components can be used equally well to represent sets and tuples. Such structures allow great flexibility y in data representation.What query language is appropriate for such structures? Here we propose a simple language UnQL for querying data organized as a rooted, edge-labeled graph. In this model, relational data may be represented as fixed-depth trees, and on such trees UnQL is equivalent to the relational algebra. The novelty of UnQL consists in its programming constructs for arbitrarily deep data and for cyclic structures. While strictly more powerful than query languages with path expressions like XSQL, UnQL can still be efficiently evaluated. We describe new optimization techniques for the deep or ""vertical"" dimension of UnQL queries. Furthermore, we show that known optimization techniques for operators on flat relations apply to the ""horizontal"" dimension of UnQL." SIGMOD Conference Optimizing Queries over Multimedia Repositories. Surajit Chaudhuri,Luis Gravano 1996 Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, indicating how well the object matches the selection condition (ranking). Also, multimedia repositories may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of queries over multimedia repositories. A key issue is the choice of the indexes used to search the repository. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we solve the problem efficiently when the predicates in the query are independent. We also show that the problem of optimizing queries that ask for a few top-ranked objects can be viewed, in many cases, as that of evaluating selection conditions. Thus, both problems can be viewed together as an extended filtering problem. SIGMOD Conference Change Detection in Hierarchically Structured Information. Sudarshan S. Chawathe,Anand Rajaraman,Hector Garcia-Molina,Jennifer Widom 1996 "Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus on hierarchically structured data. Since in many cases changes must be computed from old and new versions of the data, we define the hierarchical change detection problem as the problem of finding a ""minimum-cost edit script"" that transforms one data tree to another, and we present efficient algorithms for computing such an edit script. Our algorithms make use of some key domain characteristics to achieve substantially better performance than previous, general-purpose algorithms. We study the performance of our algorithms both analytically and empirically, and we describe the application of our techniques to hierarchically structured documents." SIGMOD Conference Rule Languages and Internal Algebras for Rule-Based Optimizers. Mitch Cherniack,Stanley B. Zdonik 1996 "Rule-based optimizers and optimizer generators use rules to specify query transformations. Rules act directly on query representations, which typically are based on query algebras. But most algebras complicate rule formulation, and rules over these algebras must often resort to calling to externally defined bodies of code. Code makes rules difficult to formulate, prove correct and reason about, and therefore compromises the effectiveness of rule-based systems.In this paper we present KOLA: a combinator-based algebra designed to simplify rule formulation. KOLA is not a user language, and KOLA's variable-free queries are difficult for humans to read. But KOLA is an effective internal algebra because its combinator-style makes queries manipulable and structurally revealing. As a result, rules over KOLA queries are easily expressed without the need for supplemental code. We illustrate this point, first by showing some transformations that despite their simplicity, require head and body routines when expressed over algebras that include variables. We show that these transformations are expressible without supplemental routines in KOLA. We then show complex transformations of a class of nested queries expressed over KOLA. Nested query optimization, while having been studied before, have seriously challenged the rule-based paradigm." SIGMOD Conference Prospector: A Content-Based Multimedia Server for Massively Parallel Architectures. "S. Choo,William O'Connell,G. Linderman,H. Chen,K. Ganapathi,Alexandros Biliris,Euthimios Panagos,David Schrader" 1996 The Prospector Multimedia Object Manager prototype is a general-purpose content analysis multimedia server designed for massively parallel processor environments. Prospector defines and manipulates user defined functions which are invoked in parallel to analyze/manipulate the contents of multimedia objects. Several computationally intensive applications of this technology based on large persistent datasets include: fingerprint matching, signature verification, face recognition, and speech recognition/translation [OIS96]. SIGMOD Conference Evaluating Queries with Generalized Path Expressions. Vassilis Christophides,Sophie Cluet,Guido Moerkotte 1996 In the past few years, query languages featuring generalized path expressions have been proposed. These languages allow the interrogation of both data and structure. They are powerful and essential for a number of applications. However, until now, their evaluation has relied on a rather naive and inefficient algorithm.In this paper, we extend an object algebra with two new operators and present some interesting rewriting techniques for queries featuring generalized path expressions. We also show how a query optimizer can integrate the new techniques. SIGMOD Conference Algorithms for Deferred View Maintenance. Latha S. Colby,Timothy Griffin,Leonid Libkin,Inderpal Singh Mumick,Howard Trickey 1996 Materialized views and view maintenance are important for data warehouses, retailing, banking, and billing applications. We consider two related view maintenance problems: 1) how to maintain views after the base tables have already been modified, and 2) how to minimize the time for which the view is inaccessible during maintenance.Typically, a view is maintained immediately, as a part of the transaction that updates the base tables. Immediate maintenance imposes a significant overhead on update transactions that cannot be tolerated in many applications. In contrast, deferred maintenance allows a view to become inconsistent with its definition. A refresh operation is used to reestablish consistency. We present new algorithms to incrementally refresh a view during deferred maintenance. Our algorithms avoid a state bug that has artificially limited techniques previously used for deferred maintenance.Incremental deferred view maintenance requires auxiliary tables that contain information recorded since the last view refresh. We present three scenarios for the use of auxiliary tables and show how these impact per-transaction overhead and view refresh time. Each scenario is described by an invariant that is required to hold in all database states. We then show that, with the proper choice of auxiliary tables, it is possible to lower both per-transaction overhead and view refresh time. SIGMOD Conference Semi-automatic, Self-adaptive Control of Garbage Collection Rates in Object Databases. Jonathan E. Cook,Artur Klauser,Alexander L. Wolf,Benjamin G. Zorn 1996 "A fundamental problem in automating object database storage reclamation is determining how often to perform garbage collection. We show that the choice of collection rate can have a significant impact on application performance and that the ""best"" rate depends on the dynamic behavior of the application, tempered by the particular performance goals of the user. We describe two semi-automatic, self-adaptive policies for controlling collection rate that we have developed to address the problem. Using trace-driven simulations, we evaluate the performance of the policies on a test database application that demonstrates two distinct reclustering behaviors. Our results show that the policies are effective at achieving user-specified levels of I/O operations and database garbage percentage. We also investigate the sensitivity of the policies over a range of object connectivities. The evaluation demonstrates that semi-automatic, self-adaptive policies are a practical means for flexibly controlling garbage collection rate." SIGMOD Conference METU Interoperable Database System. Asuman Dogac,Ugur Halici,Ebru Kilic,Gökhan Özhan,Fatma Ozcan,Sena Nural,Cevdet Dengi,Sema Mancuhan,Ismailcem Budak Arpinar,Pinar Koksal,Cem Evrendilek 1996 METU Interoperable Database System. SIGMOD Conference Structures for Manipulating Proposed Updates in Object-Oriented Databases. Michael Doherty,Richard Hull,Mohammed Rupawalla 1996 "Support for virtual states and deltas between them is useful for a variety of database applications, including hypothetical database access, version management, simulation, and active databases. The Heraclitus paradigm elevates delta values to be ""first-class citizens"" in database programming languages, so that they can be explicitly created, accessed and manipulated.A fundamental issue concerns the trade-off between the ""accuracy"" or ""robustness"" of a form of delta representation, and the ease of access and manipulation of that form. At one end of the spectrum, code-blocks could be used to represent delta values, resulting in a more accurate capture of the intended meaning of a proposed update, at the cost of more expensive access and manipulation. In the context of object-oriented databases, another point on the spectrum is ""attribute-granularity"" deltas which store the net changes to each modified attribute value of modified objects.This paper introduces a comprehensive framework for specifying a broad array of forms for representing deltas for complex value types (tuple, set, bag, list, o-set and dictionary). In general, the granularity of such deltas can be arbitrarily deep within the complex value structure. Applications of this framework in connection with hypothetical access to, and ""merging"" of, proposed updates are discussed." SIGMOD Conference The Ins and Outs (and Everthing in Between) of Data Warehousing. Phillip M. Fernandez,Donovan A. Schneider 1996 The Ins and Outs (and Everthing in Between) of Data Warehousing. SIGMOD Conference Performance Tradeoffs for Client-Server Query Processing. Michael J. Franklin,Björn Þór Jónsson,Donald Kossmann 1996 "The construction of high-performance database systems that combine the best aspects of the relational and object-oriented approaches requires the design of client-server architectures that can fully exploit client and server resources in a flexible manner. The two predominant paradigms for client-server query execution are data-shipping and query-shipping We first define these policies in terms of the restrictions they place on operator site selection during query optimization. We then investigate the performance tradeoffs between them for bulk query processing. While each strategy has advantages, neither one on its own is efficient across a wide range of circumstances. We describe and evaluate a more flexible policy called hybrid-shipping, which can execute queries at clients, servers, or any combination of the two. Hybrid-shipping is shown to at least match the best of the two ""pure"" policies, and in some situations, to perform better than both. The implementation of hybrid-shipping raises a number of difficult problems for query optimization. We describe an initial investigation into the use of a 2-step query optimization strategy as a way of addressing these issues." SIGMOD Conference Data Mining Using Two-Dimensional Optimized Accociation Rules: Scheme, Algorithms, and Visualization. Takeshi Fukuda,Yasuhiko Morimoto,Shinichi Morishita,Takeshi Tokuyama 1996 "We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, ""Age"" and ""Balance"" are two numeric attributes, and ""CardLoan"" is a Boolean attribute. Taking the pair (Age, Balance) as a point in two-dimensional space, we consider an association rule of the form((Age, Balance) ∈ P) ⇒ (CardLoan = Yes),which implies that bank customers whose ages and balances fall in a planar region P tend to use card loan with a high probability. We consider two classes of regions, rectangles and admissible (i.e. connected and x-monotone) regions. For each class, we propose efficient algorithms for computing the regions that give optimal association rules for gain, support, and confidence, respectively. We have implemented the algorithms for admissible regions, and constructed a system for visualizing the rules." SIGMOD Conference SONAR: System for Optimized Numeric AssociationRules. Takeshi Fukuda,Yasuhiko Morimoto,Shinichi Morishita,Takeshi Tokuyama 1996 SONAR: System for Optimized Numeric AssociationRules. SIGMOD Conference Is GUI Programming a Database Research Problem? Nita Goyal,Charles Hoch,Ravi Krishnamurthy,Brian Meckler,Michael Suckow 1996 Programming nontrivial GUI applications is currently an arduous task. Just as the use of a declarative language simplified the programming of database applications, we ask whether we can do the same for GUI programming? Can we then import a large body of knowledge from database research? We answer these questions by describing our experience in building nontrivial GUI applications initially using C++ programming and subsequently using Logic++, a higher order Horn clause logic language on complex objects with object-oriented features. We abstract a GUI application as a set of event handlers. Each event handler can be conceptualized as a transition from the old screen/program state to a new screen/program state. We use a data centric view of the screen/program state (i.e., every entity on the screen corresponds to proxy datum in the program) and express each event handler as a query dependent update, albeit a complicated one. To express such complicated updates we use Logic++. The proxy data are expressed as derived views that are materialized on the screen. Therefore, the system must be active in maintaining these materialized views. Consequently, each event handler is conceptually an update followed by a fixpoint computation of the proxy data. Based on our experience in building the GUI system, we observe that many database techniques such as view maintenance, active DB, concurrency control, recovery, optimization as well as language concepts such as higher order logic are useful in the context of GUI programming. SIGMOD Conference Spatial Hash-Joins. Ming-Ling Lo,Chinya V. Ravishankar 1996 We examine how to apply the hash-join paradigm to spatial joins, and define a new framework for spatial hash-joins. Our spatial partition functions have two components: a set of bucket extents and an assignment function, which may map a data item into multiple buckets. Furthermore, the partition functions for the two input datasets may be different.We have designed and tested a spatial hash-join method based on this framework. The partition function for the inner dataset is initialized by sampling the dataset, and evolves as data are inserted. The partition function for the outer dataset is immutable, but may replicate a data item from the outer dataset into multiple buckets. The method mirrors relational hash-joins in other aspects. Our method needs no pre-computed indices. It is therefore applicable to a wide range of spatial joins.Our experiments show that our method outperforms current spatial join algorithms based on tree matching by a wide margin. Further, its performance is superior even when the tree-based methods have pre-computed indices. This makes the spatial hash-join method highly competitive both when the input datasets are dynamically generated and when the datasets have pre-computed indices. SIGMOD Conference Bifocal Sampling for Skew-Resistant Join Size Estimation. Sumit Ganguly,Phillip B. Gibbons,Yossi Matias,Abraham Silberschatz 1996 This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value. Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some well-known problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Ω(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(√n lg n). By contrast, previous algorithms using a sample of similar size may require the join size to be Ω(n√n) to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective. SIGMOD Conference Multi-dimensional Resource Scheduling for Parallel Queries. Minos N. Garofalakis,Yannis E. Ioannidis 1996 Scheduling query execution plans is an important component of query optimization in parallel database systems. The problem is particularly complex in a shared-nothing execution environment, where each system node represents a collection of time-shareable resources (e.g., CPU(s), disk(s), etc.) and communicates with other nodes only by message-passing. Significant research effort has concentrated on only a subset of the various forms of intra-query parallelism so that scheduling and synchronization is simplified. In addition, most previous work has focused its attention on one-dimensional models of parallel query scheduling, effectively ignoring the potential benefits of resource sharing. In this paper, we develop an approach that is more general in both directions, capturing all forms of intra-query parallelism and exploiting sharing of multi-dimensional resource nodes among concurrent plan operators. This allows scheduling a set of independent query tasks (i.e., operator pipelines) to be seen as an instance of the multi-dimensional bin-design problem. Using a novel quantification of coarse grain parallelism, we present a list scheduling heuristic algorithm that is provably near-optimal in the class of coarse grain parallel executions (with a worst-case performance ratio that depends on the number of resources per node and the granularity parameter). We then extend this algorithm to handle the operator precedence constraints in a bushy query plan by splitting the execution of the plan into synchronized phases. Preliminary performance results confirm the effectiveness of our scheduling algorithm compared both to previous approaches and the optimal solution. Finally, we present a technique that allows us to relax the coarse granularity restriction and obtain a list scheduling method that is provably near-optimal in the space of all possible parallel schedules. SIGMOD Conference The Dangers of Replication and a Solution. "Jim Gray,Pat Helland,Patrick E. O'Neil,Dennis Shasha" 1996 Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traffic gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copy) schemes reduce this problem. A simple analytic model demonstrates these results. A new two-tier replication algorithm is proposed that allows mobile (disconnected) applications to propose tentative update transactions that are later applied to a master copy. Commutative update transactions avoid the instability of other replication schemes. SIGMOD Conference SQL Query Optimization: Reordering for a General Class of Queries. Piyush Goel,Balakrishna R. Iyer 1996 The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for a SQL query containing joins, outer joins, and groupby aggregations. Consequently, some of the reorderings with significantly lower cost may be missed. Using hypergraph model and a set of novel identities, we propose a method to reorder a SQL query containing joins, outer joins, and groupby aggregations. While these operators are sufficient to capture the SQL semantics, it is during their reordering that we identify a powerful primitive needed for a dbms. We report our findings of a simple, yet fundamental operator, generalized selection, and demonstrate its power to solve the problem of reordering of SQL queries containing joins, outer joins, and groupby aggregations. SIGMOD Conference DBMiner: Interactive Mining of Multiple-Level Knowledge in Relational Databases. Jiawei Han,Yongjian Fu,Wei Wang,Jenny Chiang,Osmar R. Zaïane,Krzysztof Koperski 1996 Based on our years-of-research, a data mining system, DB-Miner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction, progressive deepening for mining multiple-level rules, and meta-rule guided knowledge mining, the system provides a user-friendly, interactive data mining environment with good performance. SIGMOD Conference Implementing Data Cubes Efficiently. Venky Harinarayan,Anand Rajaraman,Jeffrey D. Ullman 1996 Decision support applications involve complex queries on very large databases. Since response times should be small, query optimization is critical. Users typically view the data as multidimensional data cubes. Each cell of the data cube is a view consisting of an aggregation of interest, like total sales. The values of many of these cells are dependent on the values of other cells in the data cube. A common and powerful query optimization technique is to materialize some or all of these cells rather than compute them from raw data each time. Commercial systems differ mainly in their approach to materializing the data cube. In this paper, we investigate the issue of which cells (views) to materialize when it is too expensive to materialize all views. A lattice framework is used to express dependencies among views. We present greedy algorithms that work off this lattice and determine a good set of views to materialize. The greedy algorithm performs within a small constant factor of optimal under a variety of models. We then consider the most common case of the hypercube lattice and examine the choice of materialized views for hypercubes in detail, giving some good tradeoffs between the space used and the average time to answer a query. SIGMOD Conference Query Execution Techniques for Caching Expensive Methods. Joseph M. Hellerstein,Jeffrey F. Naughton 1996 "Object-Relational and Object-Oriented DBMSs allow users to invoke time-consuming (""expensive"") methods in their queries. When queries containing these expensive methods are run on data with duplicate values, time is wasted redundantly computing methods on the same value. This problem has been studied in the context of programming languages, where ""memoization"" is the standard solution. In the database literature, sorting has been proposed to deal with this problem. We compare these approaches along with a third solution, a variant of unary hybrid hashing which we call Hybrid Cache. We demonstrate that Hybrid Cache always dominates memoization, and significantly outperforms sorting in many instances. This provides new insights into the tradeoff between hashing and sorting for unary operations. Additionally, our Hybrid Cache algorithm includes some new optimization for unary hybrid hashing, which can be used for other applications such as grouping and duplicate elimination. We conclude with a discussion of techniques for caching multiple expensive methods in a single query, and raise some new optimization problems in choosing caching techniques." SIGMOD Conference Random I/O Scheduling in Online Tertiary Storage Systems. Bruce Hillyer,Abraham Silberschatz 1996 New database applications that require the storage and retrieval of many terabytes of data are reaching the limits for disk-based storage systems, in terms of both cost and scalability. These limits provide a strong incentive for the development of databases that augment disk storage with technologies better suited to large volumes of data. In particular, the seamless incorporation of tape storage into database systems would be of great value. Tape storage is two orders of magnitude more efficient than disk in terms of cost per terabyte and physical volume per terabyte; however, a key problem is that the random access latency of tape is three to four orders of magnitude slower than disk. Thus, to incorporate a tape bulk store in an online storage system, the problem of tape access latency must be solved. One approach to reducing the latency is careful I/O scheduling. The focus of this paper is on efficient random I/O scheduling for tape drives that use a serpentine track layout, such as the Quantum DLT and the IBM 3480 and 3590. For serpentine tape, I/O scheduling is problematic because of the complex relationships between logical block numbers, their physical positions on tape, and the time required for tape positioning between these physical positions. The results in this paper show that our scheduling schemes provide a significant improvement in the latency of random access to serpentine tape. SIGMOD Conference A Framework for Supporting Data Integration Using the Materialized and Virtual Approaches. Richard Hull,Gang Zhou 1996 "This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called Squirrel integration mediators. These mediators can support the traditional virtual and materialized approaches, and also hybrids of them.In the Squirrel mediators, a relation in the integrated view can be supported as (a) fully materialized, (b) fully virtual, or (c) partially materialized (i.e., with some attributes materialized and other attributes virtual). In general, (partially) materialized relations of the integrated view are maintained by incremental updates from the source databases. Squirrel mediators provide two approaches for doing this: (1) materialize all needed auxiliary data, so that data sources do not have to be queried when processing the incremental updates; or (2) leave some or all of the auxiliary data virtual, and query selected source databases when processing incremental updates.The paper presents formal notions of consistency and ""freshness"" for integrated views defined over multiple autonomous source databases. It is shown that Squirrel mediators satisfy these properties." SIGMOD Conference CapBasED-AMS: A Capability-based and Event-driven Activity Management System. Patrick C. K. Hung,Helen P. Yeung,Kamalakar Karlapalem 1996 CapBasED-AMS: A Capability-based and Event-driven Activity Management System. SIGMOD Conference Databases and Visualization. Daniel A. Keim 1996 Databases and Visualization. SIGMOD Conference Estimating Alphanumeric Selectivity in the Presence of Wildcards. P. Krishnan,Jeffrey Scott Vitter,Balakrishna R. Iyer 1996 Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating alphanumeric selectivity.In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate.We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise. SIGMOD Conference DBSim: A Simulation Tool for Predicting Database Performance. Mike Lefler,Mark Stokrp,Craig Wong 1996 DBSim: A Simulation Tool for Predicting Database Performance. SIGMOD Conference A Query Language for Multidimensional Arrays: Design, Implementation, and Optimization Techniques. Leonid Libkin,Rona Machlin,Limsoon Wong 1996 While much recent research has focussed on extending databases beyond the traditional relational model, relatively little has been done to develop database tools for querying data organized in (multidimensional) arrays. The scientific computing community has made little use of available database technology. Instead, multidimensional scientific data is typically stored in local files conforming to various data exchange formats and queried via specialized access libraries tied in to general purpose programming languages.To allow such data to be queried using known database techniques, we design and implement a query language for multidimensional arrays. Our main design decision is to treat arrays as functions from index sets to values rather than as collection types. This leads to clean syntax and semantics as well as simple but powerful optimization rules.We present a calculus for arrays that extends standard calculi for complex objects. We derive a higher-level comprehension style query language based on this calculus and describe its implementation, including a data driver for the NetCDF data exchange format. Next, we explore some optimization rules obtained from the equational laws of our core calculus. Finally, we study the expressiveness of our calculus and prove that it essentially corresponds to adding ranking to a query language for complex objects. SIGMOD Conference Safe and Efficient Sharing of Persistent Objects in Thor. Barbara Liskov,Atul Adya,Miguel Castro,Mark Day,Sanjay Ghemawat,Robert Gruber,Umesh Maheshwari,Andrew C. Myers,Liuba Shrira 1996 Safe and Efficient Sharing of Persistent Objects in Thor. SIGMOD Conference Towards Effective and Efficient Free Space Management. Mark L. McAuliffe,Michael J. Carey,Marvin H. Solomon 1996 "An important problem faced by many database management systems is the ""online object placement problem""--the problem of choosing a disk page to hold a newly allocated object. In the absence of clustering criteria, the goal is to maximize storage utilization. For main-memory based systems, simple heuristics exist that provide reasonable space utilization in the worst case and excellent utilization in typical cases. However, the storage management problem for databases includes significant additional challenges, such as minimizing I/O traffic, coping with crash recovery, and gracefully integrating space management with locking and logging.We survey several object placement algorithms, including techniques that can be found in commercial and research database systems. We then present a new object placement algorithm that we have designed for use in Shore, an object-oriented database system under development at the University of Wisconsin--Madison. Finally, we present results from a series of experiments involving actual Shore implementations of some of these algorithms. Our results show that while current object placement algorithms have serious performance deficiencies, including excessive CPU or main memory overhead, I/O traffic, or poor disk utilization, our new algorithm consistently excellent performance in all of these areas." SIGMOD Conference Hot Mirroring: A Study to Hide Parity Upgrade Penalty and Degradations During Rebuilds for RAID5. Kazuhiko Mogi,Masaru Kitsuregawa 1996 Hot Mirroring: A Study to Hide Parity Upgrade Penalty and Degradations During Rebuilds for RAID5. SIGMOD Conference State of the Art in Workflow Management Research and Products. C. Mohan 1996 In the last few years, workflow management has become a hot topic in the research community and, especially, in the commercial arena. Workflow management is multidisciplinary in nature encompassing many aspects of computing: database management, distributed client-server systems, transaction management, mobile computing, business process reengineering, integration of legacy and new applications, and heterogeneity of hardware and software. Many academic and industrial research projects are underway. Numerous successful products have been released. Standardization efforts are in progress under the auspices of the Workflow Management Coalition. As has happened in the RDBMS area with respect to some topics, in the workflow area also, some of the important real-life problems faced by customers and product developers are not being tackled by researchers. This tutorial will survey the state of the art in workflow management research and products. SIGMOD Conference Maintaining Database Consistency in Presence of Value Dependencies in Multidatabase Systems. Claire Morpain,Michèle Cart,Jean Ferrié,Jean-François Pons 1996 The emergence of new criteria specifically adapted to multidatabase systems, in response to constraints imposed by global serializability, leads to restrictive hypotheses in order to ensure correctness of executions. This is the case with the two level serializability presented in [6], that ensures strongly correct executions if transaction programs are Local Database Preserving (LDP). The main drawback of the LDP hypothesis is that it relies on rigorous programming. The principal objective of this paper has been to suppress this drawback while conserving the strong correctness of 2LSR executions We propose defining precisely the notion of value dependencies, and managing them so as not to impose the LDP property. SIGMOD Conference Accessing Relational Databases from the World Wide Web. Tam Nguyen,V. Srinivasan 1996 With the growing popularity of the internet and the World Wide Web (Web), there is a fast growing demand for access to database management systems (DBMS) from the Web. We describe here techniques that we invented to bridge the gap between HTML, the standard markup language of the Web, and SQL, the standard query language used to access relational DBMS. We propose a flexible general purpose variable substitution mechanism that provides cross-language variable substitution between HTML input and SQL query strings as well as between SQL result rows and HTML output thus enabling the application developer to use the full capabilities of HTML for creation of query forms and reports, and SQL for queries and updates. The cross-language variable substitution mechanism has been used in the design and implementation of a system called DB2 WWW Connection that enables quick and easy construction of applications that access relational DBMS data from the Web. An end user of these DB2 WWW applications sees only the forms for his or her requests and resulting reports. A user fills out the forms, points and clicks to navigate the forms and to access the database as determined by the application. SIGMOD Conference A Content-Based Multimedia Server for Massively Parallel Architectures. "William O'Connell,Ion Tim Ieong,David Schrader,C. Watson,Grace Au,Alexandros Biliris,S. Choo,P. Colin,G. Linderman,Euthimios Panagos,J. Wang,T. Walters" 1996 A Content-Based Multimedia Server for Massively Parallel Architectures. SIGMOD Conference Fault-tolerant Architectures for Continuous Media Servers. Banu Özden,Rajeev Rastogi,Prashant J. Shenoy,Abraham Silberschatz 1996 Continuous media servers that provide support for the storage and retrieval of continuous media data (e.g., video, audio) at guaranteed rates are becoming increasingly important. Such servers, typically, rely on several disks to service a large number of clients, and are thus highly susceptible to disk failures. We have developed two fault-tolerant approaches that rely on admission control in order to meet rate guarantees for continuous media requests. The schemes enable data to be retrieved from disks at the required rate even if a certain disk were to fail. For both approaches, we present data placement strategies and admission control algorithms. We also present design techniques for maximizing the number of clients that can be supported by a continuous media server. Finally, through extensive simulations, we demonstrate the effectiveness of our schemes. SIGMOD Conference Improved Histograms for Selectivity Estimation of Range Predicates. Viswanath Poosala,Yannis E. Ioannidis,Peter J. Haas,Eugene J. Shekita 1996 Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conference LORE: A Lightweight Object REpository for Semistructured Data. Dallan Quass,Jennifer Widom,Roy Goldman,Kevin Haas,Qingshan Luo,Jason McHugh,Svetlozar Nestorov,Anand Rajaraman,Hugo Rivero,Serge Abiteboul,Jeffrey D. Ullman,Janet L. Wiener 1996 LORE: A Lightweight Object REpository for Semistructured Data. SIGMOD Conference Partition Based Spatial-Merge Join. Jignesh M. Patel,David J. DeWitt 1996 This paper describes PBSM (Partition Based Spatial-Merge), a new algorithm for performing spatial join operation. This algorithm is especially effective when neither of the inputs to the join have an index on the joining attribute. Such a situation could arise if both inputs to the join are intermediate results in a complex query, or in a parallel environment where the inputs must be dynamically redistributed. The PBSM algorithm partitions the inputs into manageable chunks, and joins them using a computational geometry based plane-sweeping technique. This paper also presents a performance study comparing the the traditional indexed nested loops join algorithm, a spatial join algorithm based on joining spatial indices, and the PBSM algorithm. These comparisons are based on complete implementations of these algorithms in Paradise, a database system for handling GIS applications. Using real data sets, the performance study examines the behavior of these spatial join algorithms in a variety of situations, including the cases when both, one, or none of the inputs to the join have an suitable index. The study also examines the effect of clustering the join inputs on the performance of these join algorithms. The performance comparisons demonstrates the feasibility, and applicability of the PBSM join algorithm. SIGMOD Conference Thinksheet: A Tool for Tailoring Complex Documents. Peter Piatko,Roman Yangarber,Dao-I Lin,Dennis Shasha 1996 Thinksheet: A Tool for Tailoring Complex Documents. SIGMOD Conference Providing Better Support for a Class of Decision Support Queries. Sudhir Rao,Antonio Badia,Dirk Van Gucht 1996 Relational database systems do not effectively support complex queries containing quantifiers (quantified queries) that are increasingly becoming important in decision support applications. Generalized quantifiers provide an effective way of expressing such queries naturally. In this paper, we consider the problem of processing quantified queries within the generalized quantifier framework. We demonstrate that current relational systems are ill-equipped, both at the language and at the query processing level, to deal with such queries. We also provide insights into the intrinsic difficulties associated with processing such queries. We then describe the implementation of a quantified query processor, Q2P, that is based on multidimensional and boolean matrix structures. We provide results of performance experiments run on Q2P that demonstrate superior performance on quantified queries. Our results indicate that it is feasible to augment relational systems with query subsystems like Q2P for significant performance benefits for quantified queries in decision support applications. SIGMOD Conference IDEA: Interactive Data Exploration and Analysis. Peter G. Selfridge,Divesh Srivastava,Lynn O. Wilson 1996 The analysis of business data is often an ill-defined task characterized by large amounts of noisy data. Because of this, business data analysis must combine two kinds of intertwined tasks: exploration and analysis. Exploration is the process of finding the appropriate subset of data to analyze, and analysis is the process of measuring the data to provide the business answer. While there are many tools available both for exploration and for analysis, a single tool or set of tools may not provide full support for these intertwined tasks. We report here on a project that set out to understand a specific business data analysis problem and build an environment to support it. The results of this understanding are, first of all, a detailed list of requirements of this task; second, a set of capabilities that meet these requirements; and third, an implemented client-server solution that addresses many of these requirements and identifies others for future work. Our solution incorporates several novel perspectives on data analysis and combines a history mechanism with a graphical, re-usable representation of the analysis and exploration process. Our approach emphasizes using the database itself to represent as many of these functions as possible. SIGMOD Conference Materialized View Maintenance and Integrity Constraint Checking: Trading Space for Time. Kenneth A. Ross,Divesh Srivastava,S. Sudarshan 1996 We investigate the problem of incremental maintenance of an SQL view in the face of database updates, and show that it is possible to reduce the total time cost of view maintenance by materializing (and maintaining) additional views. We formulate the problem of determining the optimal set of additional views to materialize as an optimization problem over the space of possible view sets (which includes the empty set). The optimization problem is harder than query optimization since it has to deal with multiple view sets, updates of multiple relations, and multiple ways of maintaining each view set for each updated relation.We develop a memoing solution for the problem; the solution can be implemented using the expression DAG representation used in rule-based optimizers such as Volcano. We demonstrate that global optimization cannot, in general, be achieved by locally optimizing each materialized subview, because common subexpressions between different materialized subviews can allow nonoptimal local plans to be combined into an optimal global plan. We identify conditions on materialized subviews in the expression DAG when local optimization is possible. Finally, we suggest heuristics that can be used to efficiently determine a useful set of additional views to materialize.Our results are particularly important for the efficient checking of assertions (complex integrity constraints) in the SQL-92 standard, since the incremental checking of such integrity constraints is known to be essentially equivalent to the view maintenance problem. SIGMOD Conference The Garlic Project. Mary Tork Roth,Manish Arya,Laura M. Haas,Michael J. Carey,William F. Cody,Ronald Fagin,Peter M. Schwarz,Joachim Thomas II,Edward L. Wimmers 1996 "The goal of the Garlic [1] project is to build a multimedia information system capable of integrating data that resides in different database systems as well as in a variety of non-database data servers. This integration must be enabled while maintaining the independence of the data servers, and without creating copies of their data. ""Multimedia"" should be interpreted broadly to mean not only images, video, and audio, but also text and application specific data types (e.g., CAD drawings, medical objects, …). Since much of this data is naturally modeled by objects, Garlic provides an object-oriented schema to applications, interprets object queries, creates execution plans for sending pieces of queries to the appropriate data servers, and assembles query results for delivery back to the applications. A significant focus of the project is support for ""intelligent"" data servers, i.e., servers that provide media-specific indexing and query capabilities [2]. Database optimization technology is being extended to deal with heterogeneous collections of data servers so that efficient data access plans can be employed for multi-repository queries.A prototype of the Garlic system has been operational since January 1995. Queries are expressed in an SQL-like query language that has been extended to include object-oriented features such as reference-valued attributes and nested sets. In addition to a C++ API, Garlic supports a novel query/browser interface called PESTO [3]. This component of Garlic provides end users of the system with a friendly, graphical interface that supports interactive browsing, navigation, and querying of the contents of Garlic databases. Unlike existing interfaces to databases, PESTO allows users to move back and forth seamlessly between querying and browsing activities, using queries to identify interesting subsets of the database, browsing the subset, querying the content of a set-valued attribute of a particularly interesting object in the subset, and so on." SIGMOD Conference Cost-Based Optimization for Magic: Algebra and Implementation. Praveen Seshadri,Joseph M. Hellerstein,Hamid Pirahesh,T. Y. Cliff Leung,Raghu Ramakrishnan,Divesh Srivastava,Peter J. Stuckey,S. Sudarshan 1996 "Magic sets rewriting is a well-known optimization heuristic for complex decision-support queries. There can be many variants of this rewriting even for a single query, which differ greatly in execution performance. We propose cost-based techniques for selecting an efficient variant from the many choices.Our first contribution is a practical scheme that models magic sets rewriting as a special join method that can be added to any cost-based query optimizer. We derive cost formulas that allow an optimizer to choose the best variant of the rewriting and to decide whether it is beneficial. The order of complexity of the optimization process is preserved by limiting the search space in a reasonable manner. We have implemented this technique in IBM's DB2 C/S V2 database system. Our performance measurements demonstrate that the cost-based magic optimization technique performs well, and that without it, several poor decisions could be made.Our second contribution is a formal algebraic model of magic sets rewriting, based on an extension of the multiset relational algebra, which cleanly defines the search space and can be used in a rule-based optimizer. We introduce the multiset θ-semijoin operator, and derive equivalence rules involving this operator. We demonstrate that magic sets rewriting for non-recursive SQL queries can be modeled as a sequential composition of these equivalence rules." SIGMOD Conference The MultiView Project: Object-Oriented View Technology and Applications. Elke A. Rundensteiner,Harumi A. Kuno,Young-Gook Ra,Viviane Crestana-Taube,Matthew C. Jones,Pedro José Marrón 1996 The MultiView Project: Object-Oriented View Technology and Applications. SIGMOD Conference Fundamental Techniques for Order Optimization. David E. Simmen,Eugene J. Shekita,Timothy Malkemus 1996 Fundamental Techniques for Order Optimization. SIGMOD Conference Static Detection of Security Flaws in Object-Oriented Databases. Keishi Tajima 1996 "Access control in function granularity is one of the features of many object-oriented databases. In those systems, the users are granted rights to invoke composed functions instead of rights to invoke primitive operations. Although primitive operations are invoked inside composed functions, the users can invoke them only through the granted functions. This achieves access control in abstract operation level. Access control utilizing encapsulated functions, however, easily causes many ""security flaws"" through which malicious users can bypass the encapsulation and can abuse the primitive operations inside the functions. In this paper, we develop a technique to statically detect such security flaws. First, we design a framework to describe security requirements that should be satisfied. Then, we develop an algorithm that syntactically analyzes program code of the functions and determines whether given security requirements are satisfied or not. This algorithm is sound, that is, whenever there is a security flaw, it detects it." SIGMOD Conference Mining Quantitative Association Rules in Large Relational Tables. Ramakrishnan Srikant,Rakesh Agrawal 1996 "We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be ""10% of married people between age 50 and 60 have at least 2 cars"". We deal with quantitative attributes by fine-partitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a ""greater-than-expected-value"" interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset." SIGMOD Conference Rapid Bushy Join-order Optimization with Cartesian Products. Bennet Vance,David Maier 1996 Query optimizers often limit the search space for join orderings, for example by excluding Cartesian products in subplans or by restricting plan trees to left-deep vines. Such exclusions are widely assumed to reduce optimization effort while minimally affecting plan quality. However, we show that searching the complete space of plans is more affordable than has been previously recognized, and that the common exclusions may be of little benefit.We start by presenting a Cartesian product optimizer that requires at most a few seconds of workstation time to search the space of bushy plans for products of up to 15 relations. Building on this result, we present a join-order optimizer that achieves a similar level of performance, and retains the ability to include Cartesian products in subplans wherever appropriate. The main contribution of the paper is in fully separating join-order enumeration from predicate analysis, and in showing that the former problem in particular can be solved swiftly by novel implementation techniques. A secondary contribution is to initiate a systematic approach to the benchmarking of join-order optimization, which we apply to the evaluation of our method. SIGMOD Conference BIRCH: An Efficient Data Clustering Method for Very Large Databases. Tian Zhang,Raghu Ramakrishnan,Miron Livny 1996 "Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle ""noise"" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior." SIGMOD Conference On-line Reorganization of Sparsely-populated B+trees. Chendong Zou,Betty Salzberg 1996 In this paper, we present an efficient method to do online reorganization of sparsely-populated B+-trees. It reorganizes the leaves first, compacting in short operations groups of leaves with the same parent. After compacting, optionally, the new leaves may swap locations or be moved into empty pages so that they are in key order on the disk. After the leaves are reorganized, the method shrinks the tree by making a copy of the upper part of the tree while leaving the leaves in place. A new concurrency method is introduced so that only a minimum number of pages are locked during reorganization. During leaf reorganization, Forward Recovery is used to save all work already done while maintaining consistency after system crashes. A heuristic algorithm is developed to reduce the number of swaps needed during leaf reorganization, so that better concurrency and easier recovery can be achieved. A detailed description of switching from the old B+-tree to the new B+-tree is described for the first time. VLDB On the Computation of Multidimensional Aggregates. Sameet Agarwal,Rakesh Agrawal,Prasad Deshpande,Ashish Gupta,Jeffrey F. Naughton,Raghu Ramakrishnan,Sunita Sarawagi 1996 On the Computation of Multidimensional Aggregates. VLDB Disseminating Updates on Broadcast Disks. Swarup Acharya,Michael J. Franklin,Stanley B. Zdonik 1996 Disseminating Updates on Broadcast Disks. VLDB The XPS Approach to Loading and Unloading Terabyte Databases. Sanket Atal 1996 The XPS Approach to Loading and Unloading Terabyte Databases. VLDB Performance of Future Database Systems: Bottlenecks and Bonananzas Chaitanya K. Baru 1996 Performance of Future Database Systems: Bottlenecks and Bonananzas VLDB Supporting Periodic Authorizations and Temporal Reasoning in Database Access Control. Elisa Bertino,Claudio Bettini,Elena Ferrari,Pierangela Samarati 1996 Supporting Periodic Authorizations and Temporal Reasoning in Database Access Control. VLDB TPC-D: The Challenges, Issues and Results. Ramesh Bhashyam 1996 TPC-D: The Challenges, Issues and Results. VLDB The X-tree : An Index Structure for High-Dimensional Data Stefan Berchtold,Daniel A. Keim,Hans-Peter Kriegel 1996 The X-tree : An Index Structure for High-Dimensional Data VLDB Query Processing Techniques for Multiversion Access Methods. Jochen Van den Bercken,Bernhard Seeger 1996 Query Processing Techniques for Multiversion Access Methods. VLDB Coalescing in Temporal Databases. Michael H. Böhlen,Richard T. Snodgrass,Michael D. Soo 1996 Coalescing in Temporal Databases. VLDB Of Objects and Databases: A Decade of Turmoil Michael J. Carey,David J. DeWitt 1996 Of Objects and Databases: A Decade of Turmoil VLDB PESTO : An Integrated Query/Browser for Object Databases. Michael J. Carey,Laura M. Haas,Vivekananda Maganty,John H. Williams 1996 PESTO : An Integrated Query/Browser for Object Databases. VLDB Dynamic Load Balancing in Hierarchical Parallel Database Systems. Luc Bouganim,Daniela Florescu,Patrick Valduriez 1996 Dynamic Load Balancing in Hierarchical Parallel Database Systems. VLDB "The Query Optimizer in Tandem's new ServerWare SQL Product." Pedro Celis 1996 "The Query Optimizer in Tandem's new ServerWare SQL Product." VLDB Querying Multiple Features of Groups in Relational Databases. Damianos Chatziantoniou,Kenneth A. Ross 1996 Querying Multiple Features of Groups in Relational Databases. VLDB Optimization of Queries with User-defined Predicates Surajit Chaudhuri,Kyuseok Shim 1996 Relational databases provide the ability to store user-defined functions and predicates which can be invoked in SQL queries. When evaluation of a user-defined predicate is relatively expensive, the traditional method of evaluating predicates as early as possible is no longer a sound heuristic. There are two previous approaches for optimizing such queries. However, neither is able to guarantee the optimal plan over the desired execution space. We present efficient techniques that are able to guarantee the choice of an optimal plan over the desired execution space. The optimization algorithm with complete rank-ordering improves upon the naive optimization algorithm by exploiting the nature of the cost formulas for join methods and is polynomial in the number of user-defined predicates (for a given number of relations.) We also propose pruning rules that significantly reduce the cost of searching the execution space for both the naive algorithm as well as for the optimization algorithm with complete rank-ordering, without compromising optimality. We also propose a conservative local heuristic that is simpler and has low optimization overhead. Although it is not always guaranteed to find the optimal plans, it produces close to optimal plans in most cases. We discuss how, depending on application requirements, to determine the algorithm of choice. It should be emphasized that our optimization algorithms handle user-defined selections as well as user-defined join predicates uniformly. We present complexity analysis and experimental comparison of the algorithms. VLDB Integrating Triggers and Declarative Constraints in SQL Database Sytems. Roberta Cochrane,Hamid Pirahesh,Nelson Mendonça Mattos 1996 Integrating Triggers and Declarative Constraints in SQL Database Sytems. VLDB Querying a Multilevel Database: A Logical Analysis. Frédéric Cuppens 1996 Querying a Multilevel Database: A Logical Analysis. VLDB Semantic Data Caching and Replacement. Shaul Dar,Michael J. Franklin,Björn Þór Jónsson,Divesh Srivastava,Michael Tan 1996 Semantic Data Caching and Replacement. VLDB Clustering Techniques for Minimizing External Path Length. Ajit A. Diwan,Sanjeeva Rane,S. Seshadri,S. Sudarshan 1996 Clustering Techniques for Minimizing External Path Length. VLDB Large Databases for Remote Sensing and GIS. A. R. Dasgupta 1996 Large Databases for Remote Sensing and GIS. VLDB Information Retrieval from an Incomplete Data Cube. Curtis E. Dyreson 1996 Information Retrieval from an Incomplete Data Cube. VLDB Analysis of n-Dimensional Quadtrees using the Hausdorff Fractal Dimension Christos Faloutsos,Volker Gaede 1996 Analysis of n-Dimensional Quadtrees using the Hausdorff Fractal Dimension VLDB "Modeling Skewed Distribution Using Multifractals and the `80-20' Law." Christos Faloutsos,Yossi Matias,Abraham Silberschatz 1996 "Modeling Skewed Distribution Using Multifractals and the `80-20' Law." VLDB Scalablity and Availability in Oracle7 7.3. Dieter Gawlick 1996 Scalablity and Availability in Oracle7 7.3. VLDB Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules. Takeshi Fukuda,Yasuhiko Morimoto,Shinichi Morishita,Takeshi Tokuyama 1996 Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules. VLDB Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases. Georges Gardarin,Jean-Robert Gruser,Zhao-Hui Tang 1996 Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases. VLDB Calibrating the Query Optimizer Cost Model of IRO-DB, an Object-Oriented Federated Database System. Georges Gardarin,Fei Sha,Zhao-Hui Tang 1996 Calibrating the Query Optimizer Cost Model of IRO-DB, an Object-Oriented Federated Database System. VLDB The Changing Landscape of the Software Industry and its Implications for India Umang Gupta 1996 The Changing Landscape of the Software Industry and its Implications for India VLDB What is the Data Warehousing Problem? (Are Materialized Views the Answer?) Ashish Gupta,Inderpal Singh Mumick 1996 What is the Data Warehousing Problem? (Are Materialized Views the Answer?) VLDB Using Referential Integrity To Easily Define Consistent Subset Replicas. Brad Hammond 1996 Using Referential Integrity To Easily Define Consistent Subset Replicas. VLDB Very Large Databases in a Commercial Application Environment Karl-Heinz Hess 1996 Very Large Databases in a Commercial Application Environment VLDB ZOO : A Desktop Experiment Management Environment. Yannis E. Ioannidis,Miron Livny,Shivani Gupta,Nagavamsi Ponnekanti 1996 ZOO : A Desktop Experiment Management Environment. VLDB Practical Issues with Commercial Use of Federated Databases. Jim Kleewein 1996 Practical Issues with Commercial Use of Federated Databases. VLDB Cache Coherency in Oracle Parallel Server. Boris Klots 1996 Cache Coherency in Oracle Parallel Server. VLDB Fast Nearest Neighbor Search in Medical Image Databases. Flip Korn,Nikolaos Sidiropoulos,Christos Faloutsos,Eliot Siegel,Zenon Protopapas 1996 Fast Nearest Neighbor Search in Medical Image Databases. VLDB Efficient Snapshot Differential Algorithms for Data Warehousing. Wilburt Labio,Hector Garcia-Molina 1996 Efficient Snapshot Differential Algorithms for Data Warehousing. VLDB SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems. Laks V. S. Lakshmanan,Fereidoon Sadri,Iyer N. Subramanian 1996 SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems. VLDB Further Improvements on Integrity Constraint Checking for Stratifiable Deductive Databases. Sin Yeung Lee,Tok Wang Ling 1996 Further Improvements on Integrity Constraint Checking for Stratifiable Deductive Databases. VLDB Obtaining Complete Answers from Incomplete Databases. Alon Y. Levy 1996 Obtaining Complete Answers from Incomplete Databases. VLDB Querying Heterogeneous Information Sources Using Source Descriptions Alon Y. Levy,Anand Rajaraman,Joann J. Ordille 1996 Querying Heterogeneous Information Sources Using Source Descriptions VLDB Database Management Systems and the Internet Susan Malaika 1996 Database Management Systems and the Internet VLDB Supporting Procedural Constructs in SQL Compilers. Nelson Mendonça Mattos 1996 Supporting Procedural Constructs in SQL Compilers. VLDB EROC: A Toolkit for Building NEATO Query Optimizers. William J. McKenna,Louis Burger,Chi Hoang,Melissa Truong 1996 EROC: A Toolkit for Building NEATO Query Optimizers. VLDB A New SQL-like Operator for Mining Association Rules. Rosa Meo,Giuseppe Psaila,Stefano Ceri 1996 A New SQL-like Operator for Mining Association Rules. VLDB MineSet(tm): A System for High-End Data Mining and Visualization. 1996 MineSet(tm): A System for High-End Data Mining and Visualization. VLDB DWMS: Data Warehouse Management System. Narendra Mohan 1996 DWMS: Data Warehouse Management System. VLDB DISNIC-PLAN: A NICNET Based Distributed Database for Micro-level Planning in India. M. Moni 1996 DISNIC-PLAN: A NICNET Based Distributed Database for Micro-level Planning in India. VLDB Effective & Efficient Document Ranking without using a Large Lexicon. Yasushi Ogawa 1996 Effective & Efficient Document Ranking without using a Large Lexicon. VLDB Extracting Large Data Sets using DB2 Parallel Edition. Sriram Padmanabhan 1996 Extracting Large Data Sets using DB2 Parallel Edition. VLDB Object Fusion in Mediator Systems. Yannis Papakonstantinou,Serge Abiteboul,Hector Garcia-Molina 1996 Object Fusion in Mediator Systems. VLDB Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing. Viswanath Poosala,Yannis E. Ioannidis 1996 Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing. VLDB Loading the Data Warehouse Across Various Parallel Architectures Vijay V. Raghavan 1996 Loading the Data Warehouse Across Various Parallel Architectures VLDB Modeling Design Versions. R. Ramakrishnan,D. Janaki Ram 1996 Modeling Design Versions. VLDB How System 11 SQL Server Became Fast. T. K. Rengarajan 1996 How System 11 SQL Server Became Fast. VLDB Intra-Transaction Parallelism in the Mapping of an Object Model to a Relational Multi-Processor System. Michael Rys,Moira C. Norrie,Hans-Jörg Schek 1996 Intra-Transaction Parallelism in the Mapping of an Object Model to a Relational Multi-Processor System. VLDB The Structured Information Manager: A Database System for SGML Documents. Ron Sacks-Davis 1996 The Structured Information Manager: A Database System for SGML Documents. VLDB Reordering Query Execution in Tertiary Memory Databases. Sunita Sarawagi,Michael Stonebraker 1996 Reordering Query Execution in Tertiary Memory Databases. VLDB WATCHMAN : A Data Warehouse Intelligent Cache Manager. Peter Scheuermann,Junho Shim,Radek Vingralek 1996 WATCHMAN : A Data Warehouse Intelligent Cache Manager. VLDB The Design and Implementation of a Sequence Database System. Praveen Seshadri,Miron Livny,Raghu Ramakrishnan 1996 The Design and Implementation of a Sequence Database System. VLDB Filter Trees for Managing Spatial Data over a Range of Size Granularities Kenneth C. Sevcik,Nick Koudas 1996 Filter Trees for Managing Spatial Data over a Range of Size Granularities VLDB SPRINT: A Scalable Parallel Classifier for Data Mining John C. Shafer,Rakesh Agrawal,Manish Mehta 1996 SPRINT: A Scalable Parallel Classifier for Data Mining VLDB "Bellcore's ADAPT/X Harness System for Managing Information on Internet and Intranets." Amit P. Sheth 1996 "Bellcore's ADAPT/X Harness System for Managing Information on Internet and Intranets." VLDB Supporting State-Wide Immunisation Tracking Using Multi-Paradigm Workflow Technology. Amit P. Sheth,Krys Kochut,John A. Miller,Devashish Worah,Souvik Das,Chenye Lin,Devanand Palaniswami,John Lynch,Ivan Shevchenko 1996 Supporting State-Wide Immunisation Tracking Using Multi-Paradigm Workflow Technology. VLDB Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies. Amit Shukla,Prasad Deshpande,Jeffrey F. Naughton,Karthikeyan Ramasamy 1996 Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies. VLDB Answering Queries with Aggregation Using Views. Divesh Srivastava,Shaul Dar,H. V. Jagadish,Alon Y. Levy 1996 Answering Queries with Aggregation Using Views. VLDB Incremental Maintenance of Externally Materialized Views. Martin Staudt,Matthias Jarke 1996 Incremental Maintenance of Externally Materialized Views. VLDB Query Decomposition and View Maintenance for Query Languages for Unstructured Data. Dan Suciu 1996 Query Decomposition and View Maintenance for Query Languages for Unstructured Data. VLDB Implementation and Analysis of a Parallel Collection Query Language. Dan Suciu 1996 Implementation and Analysis of a Parallel Collection Query Language. VLDB Tribeca: A Stream Database Manager for Network Traffic Analysis. Mark Sullivan 1996 Tribeca: A Stream Database Manager for Network Traffic Analysis. VLDB Sampling Large Databases for Association Rules. Hannu Toivonen 1996 Sampling Large Databases for Association Rules. VLDB The Role of Integrity Constraints in Database Interoperation. Mark W. W. Vermeer,Peter M. G. Apers 1996 The Role of Integrity Constraints in Database Interoperation. VLDB Applying Data Mining Techniques to a Health Insurance Information System. Marisa S. Viveros,John P. Nearhos,Michael J. Rothman 1996 Applying Data Mining Techniques to a Health Insurance Information System. SIGMOD Record ACT-NET - The Active Database Management System Manifesto: A Rulebase of ADBMS Features. 1996 ACT-NET - The Active Database Management System Manifesto: A Rulebase of ADBMS Features. SIGMOD Record Workshop Report: The First International Workshop on Active and Real-Time Database Systems (ARTDB-95). Mikael Berndtsson,Jörgen Hansson 1996 Workshop Report: The First International Workshop on Active and Real-Time Database Systems (ARTDB-95). SIGMOD Record An Orthogonally Persistent Java. Malcolm P. Atkinson,Laurent Daynès,Mick J. Jordan,Tony Printezis,Susan Spence 1996 The language Java is enjoying a rapid rise in popularity as an application programming language. For many applications an effective provision of database facilities is required. Here we report on a particular approach to providing such facilities, called “orthogonal persistence”. Persistence allows data to have lifetimes that vary from transient to (the best approximation we can achieve to) indefinite. It is orthogonal persistence if the available lifetimes are the same for all kinds of data. We aim to show that the programmer productivity gains and possible performance gains make orthogonal persistence a valuable augmentation of Java. SIGMOD Record Overview of the STanford Real-time Information Processor (STRIP). Brad Adelberg,Ben Kao,Hector Garcia-Molina 1996 We believe that the greatest growth potential for soft real-time databases is not as isolated monolithic databases but as components in open systems consisting of many heterogenous databases. In such environments, the flexibility to deal with unpredictable situations and the ability to cooperate with other databases (often non-real-time databases) is just as important as the guarantee of stringent timing constraints. In this paper, we describe a database designed explicitly for heterogeneous environments, the STanford Real-time Information Processor (STRIP). STRIP, which runs on standard Posix Unix, is a soft real-time main memory database with special facilities for importing and exporting data as well as handling derived data. We will describe the architecture of STRIP, its unique features, and its potential uses in overall system architectures. SIGMOD Record Spotfire: An Information Exploration Environment. Christopher Ahlberg 1996 In this paper we examine the issues involved in developing information visualisation systems and present a framework for their construction. The framework addresses the components which must be considered in providing effective visualisations. The framework is specified using a declarative object oriented language; the resulting object model may be mapped to a variety of graphical user interface development platforms. This provides general support to developers of visualisation systems. A prototype system exists which allows the investigation of alternative visualisations for a range of data sources. SIGMOD Record Advances in Real-Time Database Systems Research. Azer Bestavros 1996 Advances in Real-Time Database Systems Research. SIGMOD Record Report on First International Workshop on Real-Time Database Systems. Azer Bestavros,Kwei-Jay Lin,Sang Hyuk Son 1996 Report on First International Workshop on Real-Time Database Systems. SIGMOD Record TCP-D - The Challenges, Issues and Results. Ramesh Bhashyam 1996 TCP-D - The Challenges, Issues and Results. SIGMOD Record DeeDS Towards a Distributed and Active Real-Time Database System. Sten Andler,Jörgen Hansson,Joakim Eriksson,Jonas Mellin,Mikael Berndtsson,Bengt Eftring 1996 DeeDS Towards a Distributed and Active Real-Time Database System. SIGMOD Record The BeSS Object Storage Manager: Architecture Overview. Alexandros Biliris,Euthimios Panagos 1996 The BeSS Object Storage Manager: Architecture Overview. SIGMOD Record MQSeries and CICS Link for Lotus Notes. 1996 MQSeries and CICS Link for Lotus Notes. SIGMOD Record Integrating Contents and Structure in Text Retrieval. Ricardo A. Baeza-Yates,Gonzalo Navarro 1996 The purpose of a textual database is to store textual documents. These documents have not only textual contents, but also structure. Many traditional text database systems have focused only on querying by contents or by structure. Recently, a number of models integrating both types of queries have appeared. We argue in favor of that integration, and focus our attention on these recent models, covering a representative sampling of the proposals in the field. We pay special attention to the tradeoffs between expressiveness and efficiency, showing the compromises taken by the models. We argue in favor of achieving a good compromise, since being weak in any of these two aspects makes the model useless for many applications. SIGMOD Record Control Strategies for Complex Relational Query Processing in Shared Nothing Systems. Lionel Brunie,Harald Kosch 1996 In this paper, we present an original and complete methodology for supervising relational query processing in shared nothing systems. A new control mechanism is introduced which allows the detection and the correction of optimizer estimation errors and load imbalance. We especially focus on the management of intraprocessor communication and on the overlapping of communication and computation. Performance evaluations on an hypercube and a grid interconnection machine show the efficiency and the robustness of the proposed methods. SIGMOD Record OLAP, Relational, and Multidimensional Database Systems. George Colliat 1996 OLAP, Relational, and Multidimensional Database Systems. SIGMOD Record Domains, Relations and Religious Wars. Rafael Camps 1996 Domains, Relations and Religious Wars. SIGMOD Record "Information Visualization, Guest Editors' Foreword." Tiziana Catarci,Isabel F. Cruz 1996 "Information Visualization, Guest Editors' Foreword." SIGMOD Record 3D Geographic Network Displays. Kenneth C. Cox,Stephen G. Eick,Taosong He 1996 Many types of information may be represented as graphs or networks with the nodes corresponding to entities and the links to relationships between entities. Often there is geographical information associated with the network. The traditional way to visualize geographical networks employs node and link displays on a two-dimensional map. These displays are easily overwhelmed, and for large networks become visually cluttered and confusing. To overcome these problems we have invented five novel network views that generalize the traditional displays. Two of the views show the complete network, while the other three concentrate on a portion of a larger network defined by connectivity to a given node. Our new visual metaphors retain many of the well-known advantages of the traditional network maps, while exploiting three-dimensional graphics to address some of the fundamental problems limiting the scalability of two-dimensional displays. SIGMOD Record "UniSQL's Next-Generation Object-Relational Database Management System." "Albert D'Andrea,Phil Janus" 1996 "Object-Relational DBMSs have been receiving a great deal of attention from industry analysts and press as the next generation of database management systems. The motivation for a next generation DBMS is driven by the reality of shortened business cycles. This dynamic environment demands fast, cost-effective, time-to-market of new or modified business processes, services, and products. To support this important business need, the next generation DBMS must: 1. leverage the large investments made in existing relational technology, both in data and skill set; 2. Take advantage of the flexibility, productivity, and performance benefits of OO modeling; and 3. Integrate robust DBMS services for production quality systems. The objective of this article is to provide a brief overview of UniSQL's commercial object-relational database management system." SIGMOD Record In Reply to Domains, relations and Religious Wars. Hugh Darwen 1996 In Reply to Domains, relations and Religious Wars. SIGMOD Record "A Response to R. Camps' Article ``Domains, Relations and Religious Wars''." C. J. Date 1996 "A Response to R. Camps' Article ``Domains, Relations and Religious Wars''." SIGMOD Record Middle East Technical University Software Research and Development Center. Asuman Dogac 1996 "Middle East Technical University (METU) is the leading technical university in Turkey. The Software Research and Development Center was established by the Scientific and Technical Research Council of Turkey (TUBITAK) at the Department of Computer Engineering of METU in October 1991. The aim of this center is twofold: to lead large scale software research and development projects, and to foster international cooperation. SRDC is involved in a number of research and development projects supported by the government, industrial companies and international organizations. Although SRDC projects also cover other fields of computer science, the main emphasis is on database systems. SRDC is organized around the ongoing projects and several engineers and graduate students work in these projects: M. Altinel, B. Arpinar, I. Cingil, Y. Ceken, C. Dengi, E. Gokkoca, C. Evrendilek, P. Karagoz, E. Kilic, P. Koksal, S. Mancuhan, S. Nural, F. Ozcan, G. Ozhan, V. Sadjadi, N. Tatbul. Its infrastructure includes a LAN with Sun Workstations and PCs and commercial software like Oracle7, Sybase, Informix, Adabas D, Ingres and DEC's ObjectBroker. SRDC is a beta test site for several products including SunSoft's Joe and Concerto and Orbix's Object Transaction Service. The remainder of this report describes the main projects." SIGMOD Record New Standard for Stored Procedures in SQL. Andrew Eisenberg 1996 New Standard for Stored Procedures in SQL. SIGMOD Record Lifestreams: A Storage Model for Personal Data. Eric Freeman,David Gelernter 1996 "Conventional software systems, such as those based on the “desktop metaphor,” are ill-equipped to manage the electronic information and events of the typical computer user. We introduce a new metaphor, Lifestreams, for dynamically organizing a user's personal workspace. Lifestreams uses a simple organizational metaphor, a time-ordered stream of documents, as an underlying storage system. Stream filters are used to organize, monitor and summarize information for the user. Combined, they provide a system that subsumes many separate desktop applications. This paper describes the Lifestreams model and our prototype system." SIGMOD Record "Illustra's Web DataBlade Module." John Gaffney 1996 "Illustra's Web DataBlade Module." SIGMOD Record Real-Time Index Concurrency Control. Jayant R. Haritsa,S. Seshadri 1996 Real-Time Index Concurrency Control. SIGMOD Record Open Issues in Parallel Query Optimization. Waqar Hasan,Daniela Florescu,Patrick Valduriez 1996 We provide an overview of query processing in parallel database systems and discuss several open issues in the optimization of queries for parallel machines. SIGMOD Record Applying database Visualization to the World Wide Web. Masum Z. Hasan,Alberto O. Mendelzon,Dimitra Vista 1996 In this paper, we present visualizations of parts of the network of documents comprising the World Wide Web. We describe how we are using the Hy+ visualization system to visualize the portion of the World Wide Web explored during a browsing session. As the user browses, the web browser communicates the URL and title of each document fetched as well as all the anchors contained in the document. Hy+ displays graphically the history of the navigation and multiple views of the structure of that portion of the web. SIGMOD Record On the Cost of Monitoring and Reorganization of Object Bases for Clustering. Carsten Andreas Gerlhof,Alfons Kemper,Guido Moerkotte 1996 Clustering is one of the most effective means to enhance the performance of object base applications. Consequently, many proposals exist for algorithms computing good object placements depending on the application profile. However, in an effective object base reorganization tool the clustering algorithm is only one constituent. In this paper, we report on our object base reorganization tool that covers all stages of reorganizing the objects: the application profile is determined by a monitoring tool, the object placement is computed from the monitored access statistics utilizing a variety of clustering algorithms and, finally, the reorganization tool restructures the object base accordingly. The costs as well as the effectiveness of these tools is quantitatively evaluated on the basis of the OO1-benchmark. SIGMOD Record Dynamic Information Visualization. Yannis E. Ioannidis 1996 Dynamic queries constitute a very powerful mechanism for information visualization; some universe of data is visualized, and this visualization is modified on-the-fly as users modify the range of interest within the domains of the various attributes of the visualized information. In this paper, we analyze dynamic queries and offer some natural generalizations of the original concept by establishing a connection to SQL. We also discuss some implementation ideas that should make these generalizations efficient as well. SIGMOD Record Pixel-oriented Database Visualizations. Daniel A. Keim 1996 In this paper, we provide an overview of several pixel-oriented visualization techniques which have been developed over the last years to support an effective querying and exploration of large databases. Pixel-oriented techniques use each pixel of the display to visualize one data value and therefore allow the visualization of the largest amount of data possible. The techniques may be divided into query-independent techniques which directly visualize the data (or a certain portion of it) and query-dependent techniques which visualize the relevance of the data with respect to a specific query. An example for the class of query-independent techniques is the recursive pattern technique which is based on a generic recursive scheme generalizing a wide range of pixel-oriented arrangements for visualizing large databases. Examples for the class of query-dependent techniques are the generalized spiral and circle-segments techniques, which visualize the distance with respect to a database query and arrange the most relevant data items in the center of the display. SIGMOD Record A Framework for Information Visualisation. Jessie B. Kennedy,Kenneth J. Mitchell,Peter J. Barclay 1996 In this paper we examine the issues involved in developing information visualisation systems and present a framework for their construction. The framework addresses the components which must be considered in providing effective visualisations. The framework is specified using a declarative object oriented language; the resulting object model may be mapped to a variety of graphical user interface development platforms. This provides general support to developers of visualisation systems. A prototype system exists which allows the investigation of alternative visualisations for a range of data sources. SIGMOD Record Enhancing External Consistency in Real-Time Transactions. Kwei-Jay Lin,Shing-Shan Peng 1996 Enhancing External Consistency in Real-Time Transactions. SIGMOD Record Real-Time Database - Similarity Semantics and Resource Scheduling. Tei-Wei Kuo,Aloysius K. Mok 1996 Real-Time Database - Similarity Semantics and Resource Scheduling. SIGMOD Record Much Ado About Shared-Nothing. Michael G. Norman,Thomas Zurek,Peter Thanisch 1996 "In a 'shared-nothing' parallel computer, each processor has its own memory and disks and processors communicate by passing messages through an interconnect. Many academic researchers, and some vendors, assert that shared-nothingness is the 'consensus' architecture for parallel DBMSs. This alleged consensus is used as a justification for simulation models, algorithms, research prototypes and even marketing campaigns. We argue that shared-nothingness is no longer the consensus hardware architecture and that hardware resource sharing is a poor basis for categorising parallel DBMS software architectures if one wishes to compare the performance characteristics of parallel DBMS products." SIGMOD Record Database Research at the Indian Institute of Technology, Bombay. D. B. Phatak,Nandlal L. Sarda,S. Seshadri,S. Sudarshan 1996 Database Research at the Indian Institute of Technology, Bombay. SIGMOD Record Shutdown, Budget, and Funding. Xiaolei Qian 1996 There are few new funding announcements and requests for proposals, mostly due to the partial government shutdown and the budget impasse. We will report on the potential impact on NSF of the government shutdown and a 7-year balanced budget. We then briefly discuss some BAAs from ARPA, Rome Laboratory, and the Air Force. SIGMOD Record "Scientist's Called Upon to Take Actions." Xiaolei Qian 1996 "Scientist's Called Upon to Take Actions." SIGMOD Record New Programs at DARPA and NSF. Xiaolei Qian 1996 We will share with readers some good news on NSF and Defense budget, and report on several interesting new programs at DARPA and NSF. SIGMOD Record The Aggregate Data Problem: A System for their Definition and Management. Maurizio Rafanelli,Antonia Bezenchek,Leonardo Tininini 1996 In this paper we describe the fundamental components of a database management system for the definition, storage, manipulation and query of aggregate data, i.e. data which are obtained by applying statistical aggregations and statistical analysis functions over raw data. In particular, the attention has been focused on: (1) a data structure for the efficient storage and manipulation of aggregate data, called ADaS; (2) the graphical structures of the aggregate data model ADAMO for a more user-friendly definition and query of aggregate data; (3) a graphical user interface which enables a straightforward specification of the ADAMO structures; (4) a textual declarative query language to retrieve data from the aggregate database, called ADQUEL. SIGMOD Record Integrating Temporal, Real-Time, and Active Databases. Krithi Ramamritham,Rajendran M. Sivasankaran,John A. Stankovic,Donald F. Towsley,Ming Xiong 1996 "To meet the needs of many real-world control applications, concepts from Temporal, Real-Time, and Active Databases must be integrated: Since the system's data is supposed to reflect the environment being controlled, they must be updated frequently to maintain temporal validity; Many activities, including those that perform the updates, work under time constraints; The occurrence of events, for example, emergency events, trigger actions. In these systems, meeting timeliness, predictability, and QoS guarantee requirements — through appropriate resource and overload management — become very important. So, algorithms and protocols for concurrency control, recovery, and scheduling are needed. These algorithms must exploit semantics of the data and the transactions to be responsive and efficient. Whereas time cognizant scheduling, concurrency control and conflict resolution have been studied in the literature, recovery issues have not. We have developed strategies for data placement at the appropriate level of memory hierarchy, for avoiding undoing/redoing by exploiting data/transaction characteristics, and for placing logs at the appropriate level in the memory hierarchy. Another issue that we have studied deals with the assignment of priority to transactions in active real-time database systems. We are also studying concurrency control for temporal and multi-media data. We have built RADEx, a simulation environment to evaluate our solutions." SIGMOD Record To Table or Not to Table: a Hypertabular Answer. Giuseppe Santucci,Laura Tarantino 1996 Suitable data set organizers are necessary to help users assimilating information retrieved from a database. In this paper we present (1) a general hypertextual framework for the interaction with tables, and (2) a specialization of the framework in order to present in hypertextual format the results of queries expressed in terms of a visual semantic query language. SIGMOD Record Report from the NSF Workshop on Workflow and Process Automation in Information Systems. Amit P. Sheth,Dimitrios Georgakopoulos,Stef Joosten,Marek Rusinkiewicz,Walt Scacchi,Jack C. Wileden,Alexander L. Wolf 1996 An interdisciplinary research community needs to address challenging issues raised by applying workflow management technology in information systems. This conclusion results from the NSF workshop on Workflow and Process Automation in Information Systems which was held at the State Botanical Garden of Georgia during May 8-10, 1996. The workshop brought together active researchers and practitioners from several communities, with significant representation from database and distributed systems, software process and software engineering, and computer supported cooperative work. The presentations given at the workshop are available in the form of an electronic proceedings of this workshop at http://lsdis.cs.uga.edu/activities/). This report is the joint work of selected representatives from the workshop and it documents the results of significant group discussions and exchange of ideas. SIGMOD Record The Mariposa Distributed Database Management System. Jeff Sidell 1996 The Mariposa Distributed Database Management System. SIGMOD Record Database Research: Achievements and Opportunities Into the 21st Century. Abraham Silberschatz,Michael Stonebraker,Jeffrey D. Ullman 1996 "In May, 1995 an NSF workshop on the future of database management systems research was convened. This paper reports the conclusions of that meeting. Among the most important directions for future DBMS research recommended by the panel are: support for multimedia objects; managing distributed and loosely coupled information, as on the world-wide web; supporting new database applications such as data mining and warehousing; workflow and other complex transaction-management problems, and enhancing the ease-of-use of DBMS''s for both users and system managers." SIGMOD Record Incremental data Structures and Algorithms for Dynamic Query Interfaces. Egemen Tanin,Richard Beigel,Ben Shneiderman 1996 Dynamic query interfaces (DQIs) form a recently developed method of database access that provides continuous realtime feedback to the user during the query formulation process. Previous work shows that DQIs are elegant and powerful interfaces to small databases. Unfortunately, when applied to large databases, previous DQI algorithms slow to a crawl. We present a new approach to DQI algorithms that works well with large databases. SIGMOD Record Improving Timeliness in Real-Time Secure Database Systems. Sang Hyuk Son,Rasikan David,Bhavani M. Thuraisingham 1996 Database systems for real-time applications must satisfy timing constraints associated with transactions, while maintaining data consistency. In addition to real-time requirements, security is usually required in many applications. Multilevel security requirements introduce a new dimension to transaction processing in real-time database systems. In this paper, we argue that because of the complexities involved, trade-offs need to be made between security and timeliness. We briefly present the secure two-phase locking protocol and discuss an adaptive method to support trading off security for timeliness, depending on the current state of the system. The performance of the adaptive secure two-phase locking protocol shows improved timeliness. We also discuss future research direction to improve timeliness of secure database systems. SIGMOD Record Temporal Database Bibliography Update. Vassilis J. Tsotras,Anil Kumar 1996 Temporal Database Bibliography Update. SIGMOD Record Exploiting Main Memory DBMS Features to Improve Real-Time Concurrency Control Protocols. Özgür Ulusoy,Alejandro P. Buchmann 1996 Exploiting Main Memory DBMS Features to Improve Real-Time Concurrency Control Protocols. SIGMOD Record Database Research at Arizona State University. Susan Darling Urban,Suzanne W. Dietrich,Forouzan Golshani 1996 Database Research at Arizona State University. SIGMOD Record Object Query Standards. Andrew E. Wade 1996 "As object technology is adopted by software systems for analysis and design, language, GUI, and frameworks, the database community also is working to support objects, and to develop standards for that support. A key benefit of object technology is the ability for different objects and object tools to interoperate, so it's critical that such DBMS object standards interoperate with those of the rest of the object world. Starting with a discussion of the new issues objects bring to query standards, we present the efforts of various groups relevant to this, including ODMG, OMG, ANSI X3H2 (SQL3), and recent merger efforts feeding into SQL3. What's different with Objects? ODMG's OQL OMG's Query Service SQL3's Object extensions Efforts to merge" SIGMOD Record "Editor's Notes." Jennifer Widom 1996 "Editor's Notes." SIGMOD Record "Editor's Notes." Jennifer Widom 1996 "Editor's Notes." SIGMOD Record "Editor's Notes." Jennifer Widom 1996 "Editor's Notes." SIGMOD Record Guidelines for Presentation and Comparison of Indexing Techniques. Justin Zobel,Alistair Moffat,Kotagiri Ramamohanarao 1996 Descriptions of new indexing techniques are a common outcome of database research, but these descriptions are sometimes marred by poor methodology and a lack of comparison to other schemes. In this paper we describe a framework for presentation and comparison of indexing schemes that we believe sets a minimum standard for development and dissemination of research results in this area. ICDE ULIXES: Building Relational Views over the Web. Paolo Atzeni,Alessandro Masci,Giansalvatore Mecca,Paolo Merialdo,Elena Tabet 1997 The authors consider structured Web sites, those sites in which structures are so tight and regular that one can assimilate the site, from the logical viewpoint, to a conventional database. They have argued that, with respect to structured Web servers, it is possible to apply ideas from traditional database techniques, specifically with respect to design, query, and update. They focus on the querying process, which consists of associating a scheme with a server and then using this scheme to pose queries in a high level query language. To describe the scheme, they use a specific data model, called the ARANEUS Data Model (ADM). They call ADM a page oriented model, in the sense that the main construct of the model is that of a page scheme, used to describe the structure of sets of homogeneous pages in the server. ADM schemes are then offered to the user, who can query them using the ULIXES language, whose expressions produce relations as results. These are essentially relational views over Web data and can therefore be queried using any relational query language. It should be noted that the approach inherited some ideas from other proposals for query languages for the Web. However, these approaches are mainly based on a loose notion of structure, and tend to view the Web as a huge collection of unstructured objects, organized as a graph. In contrast, the approach explicitly considers structure, both in the information source (the Web) and in the derived information (the relational views). ICDE Universal Access versus Universal Storage. William Baker 1997 Universal Access versus Universal Storage. ICDE SEOF: An Adaptable Object Prefetch Policy for Object-Oriented Database Systems. Jung-Ho Ahn,Hyoung-Joo Kim 1997 The performance of object access can be drastically improved by efficient object prefetch. In this paper we present a new object prefetch policy, Selective Eager Object Fetch(SEOF) which prefetches objects only from selected candidate pages without using any high level object semantics. Our policy considers both the correlations and the frequencies of fetching objects. Unlike existing prefetch policies, this policy utilizes the memory and the swap space of clients efficiently without resource exhaustion. Furthermore, the proposed policy has good adaptability to both the effectiveness of clustering and database size. We show the performance of the proposed policy through experiments over various multi-client system configurations. ICDE The Constraint-Based Knowledge Broker System. Jean-Marc Andreoli,Uwe M. Borghoff,Pierre-Yves Chevalier,Boris Chidlovskii,Remo Pareschi,Jutta Willamowski 1997 The amount of information available from electronic sources on the World Wide Web and other on-line information repositories is highly heterogeneous and increases dramatically. Tools are needed to extract relevant information from these repositories. The Constraint-Based Knowledge Brokers project (CBKB) at RXRC Grenoble realizes sophisticated facilities for efficient information retrieval, schema integration, and knowledge fusion. The current implementation of the CBKB research prototype involves three kinds of agents: a) users, who input queries and process answers (i.e., ranking, fusion) through a GUI; b) wrappers, capable of interrogating heterogeneous information sources, which can provide answers to elementary queries (essentially various public bibliographic catalogues available on the Web, as well as preprint archives and opera information repositories); c) brokers, which can manage complex queries (i.e., decompose a complex query, recompose the partial answers, synthesize a full answer) and which mediate between the GUI and the different wrappers. ICDE Modeling Multidimensional Databases. Rakesh Agrawal,Ashish Gupta,Sunita Sarawagi 1997 We propose a data model and a few algebraic operations that provide semantic foundation to multidimensional databases. The distinguishing feature of the proposed model is the symmetric treatment not only of all dimensions but also measures. The model provides support for multiple hierarchies along each dimension and support for adhoc aggregates. The proposed operators are composable, reorderable, and closed in application. These operators are also minimal in the sense that none can be expressed in terms of others nor can any one be dropped without sacrificing functionality. They make possible the declarative specification and optimization of multidimensional database queries that are currently specified operationally. The operators have been designed to be translated to SQL and can be implemented either on top of a relational database system or within a special purpose multidimensional database engine. In effect, they provide an algebraic application programming interface (API) that allows the separation of the frontend from the backend. Finally, the proposed model provides a framework in which to study multidimensional databases and opens several new research problems. ICDE Data Warehousing: Dealing with the Growing Pains. Robert Armstrong 1997 A data warehouse provides a customer with information to run and plan their business. It is true that if the data warehouse can not quickly adapt to changes in the environment then the company will lose the advantage that information provides. A warehouse must be built with a solid foundation that is flexible and responsive to business changes. The purpose of this paper is to share experiences in the area of managing the growth within the data warehouse. There are many technical issues that need to be addressed as the data warehouse grows in multiple dimensions. The ideas in this paper should enable you to provide the correct foundation for a long term warehouse. Very few companies are discussing these issues and the lack of discussion leads to a lack of knowledge that will further lead to poor architectural choices. This paper will articulate not only the benefits that are derived from data warehousing today but how to prepare to reap benefits for many tomorrow s. It will also explore the questions to ask, the points to make, and the issues to be addressed to have a long term successful data warehouse project. ICDE An Argument in Favour of Presumed Commit Protocol. Yousef J. Al-Houmaily,Panos K. Chrysanthis,Steven P. Levitan 1997 We argue in favor of the presumed commit protocol by proposing two new presumed commit variants that significantly reduce the cost of logging activities associated with the original presumed commit protocol. Furthermore, for read-only transactions, we apply our unsolicited update-vote optimization and show that the cost associated with this type of transactions is the same in both presumed commit and presumed abort protocols, thus, nullifying the basis for the argument that favors the presumed abort protocol. This is especially important for modern distributed environments which are characterized by high reliability and high probability of transactions being committed rather than aborted. ICDE Performance Evaluation of Rule Semantics in Active Databases. Elena Baralis,Andrea Bianco 1997 Different rule execution semantics may be available in the same active database system. We performe several simulation experiments to evaluate the performance trade-offs yielded by different execution semantics in various operating conditions. In particular, we evaluate the effect of executing transaction and rule statements that affect a varying number of data instances, and applications with different rule triggering breadth and depth. Since references to data changed by the database operation triggering the rules are commonly used in active rule programming, we also analyze the impact of its management on overall performance. ICDE Pinwheel Scheduling for Fault-Tolerant Broadcast Disks in Real-time Database Systems. Sanjoy K. Baruah,Azer Bestavros 1997 Abstract The design of programs for broadcast disks which incorporate real-time and fault-tolerance requirements is considered. A generalized model for real-time fault-tolerant broadcast disks is defined. It is shown that designing programs for broadcast disks specified in this model is closely related to the scheduling of pinwheel task systems. Some new results in pinwheel scheduling theory are derived, which facilitate the efficient generation of real-time fault-tolerant broadcast disk programs. ICDE Tools to Enable Interoperation of Heterogeneous Databases. Wernher Behrendt,N. J. Fiddian,Ajith P. Madurapperuma 1997 "We demonstrate a prototype toolkit called ITSE (Integrated Translation Support Environment), for interoperation and migration of heterogeneous database systems. The main objective of ITSE is to enable transparency in heterogeneous database environments. Therefore, tools have been developed to support flexible configuration of databases and the wrapping or migrating of legacy systems in intranets. The tools themselves are aimed at MDBMS administrators and MIS analysts. These users can tailor the toolkit's operation so that end users are shielded from the underlying heterogeneity of their information system." ICDE ODB-QOPTIMIZER: A Tool for Semantic Query Optimization in OODB. Sonia Bergamaschi,Domenico Beneventano,Claudio Sartori,Maurizio Vincini 1997 ODB-QOptimizer is a ODMG 93 compliant tool for the schema validation and semantic query optimization. The approach is based on two fundamental ingredients. The first one is the OCDL description logics (DLs) proposed as a common formalism to express class descriptions, a relevant set of integrity constraints rules (IC rules) and queries. The second one are DLs inference techniques, exploited to evaluate the logical implications expressed by IC rules and thus to produce the semantic expansion of a given query. ICDE "Title, General Chairs' Message, Program Chairs' Message, Committees, Reviewers, Author Index." 1997 "Title, General Chairs' Message, Program Chairs' Message, Committees, Reviewers, Author Index." ICDE Titan: A High-Performance Remote Sensing Database. Chialin Chang,Bongki Moon,Anurag Acharya,Carter Shock,Alan Sussman,Joel H. Saltz 1997 There are two major challenges for a high performance remote sensing database. First, it must provide low latency retrieval of very large volumes of spatio temporal data. This requires effective declustering and placement of a multidimensional dataset onto a large disk farm. Second, the order of magnitude reduction in data size due to post processing makes it imperative, from a performance perspective, that the post processing be done on the machine that holds the data. This requires careful coordination of computation and data retrieval. The paper describes the design, implementation and evaluation of Titan, a parallel shared nothing database designed for handling remote sensing data. The computational platform for Titan is a 16 processor IBM SP-2 with four fast disks attached to each processor. Titan is currently operational and contains about 24 GB of AVHRR data from the NOAA-7 satellite. The experimental results show that Titan provides good performance for global queries and interactive response times for local queries. ICDE On Incremental Cache Coherency Schemes in Mobile Computing Environments. Jun Cai,Kian-Lee Tan,Beng Chin Ooi 1997 Re-examines the cache coherency problem in a mobile computing environment in the context of relational operations (i.e. selection, projection and join). We propose a taxonomy of cache coherency schemes, and as case studies, we pick several schemes for further study. These schemes are novel in several ways. First, they are incremental. Second, they are an integration of (and built on) techniques in view maintenance in centralized systems and cache invalidation in client-server computing environments. We conducted extensive studies based on a simulation model. Our study shows the effectiveness of these algorithms in reducing uplink transmission and average access times. Moreover, the class of algorithms that exploit collaboration between the client and server performs best in most cases. We also study extended versions of this class of algorithms to further cut down on the work performed by the server. ICDE A Generic Query-Translation Framework for a Mediator Architecture. Jacques Calmet,Sebastian Jekutsch,Joachim Schü 1997 A mediator is a domain-specific tool to support uniform access to multiple heterogeneous information sources and to abstract and combine data from different but related databases to gain new information. This middleware product is urgently needed for these frequently occurring tasks in a decision support environment. In order to provide a front end, a mediator usually defines a new language. If an application or a user submits a question to the mediator, it has to be decomposed into several queries to the underlying information sources. Since these sources can only be accessed using their own query language, a query translator is needed. This paper presents a new approach for implementing query translators. It supports conjunctive queries as well as negation. Care is taken to enable information sources of which processing capabilities do not allow conjunctive queries in general. Rapid implementation is guided by reusing previously prepared code. The specification of the translator is done declaratively and domain--independent. ICDE Failure Handling for Transaction Hierarchies. Qiming Chen,Umeshwar Dayal 1997 "Previously, failure recovery mechanisms have been developed separately for nested transactions and for transactional workflows specified as ""flat"" flow graphs. The paper develops unified techniques for complex business processes modeled as cooperative transaction hierarchies. Multiple cooperative transaction hierarchies often have operational dependencies, thus a failure occurring in one transaction hierarchy may need to be transferred to another. The existing transaction models do not support failure handling across transaction hierarchies. The authors introduce the notion of transaction execution history tree which allows one to develop a unified hierarchical failure recovery mechanism applicable to both nested and flat transaction structures. They also develop a cross-hierarchy undo mechanism for determining failure scopes and supporting backward and forward failure recovery over multiple transaction hierarchies. These mechanisms form a structured and unified approach for handling failures in flat transactional workflows, along a transaction hierarchy, and across transaction hierarchies." ICDE Semantic Dictionary Design for Database Interoperability. Silvana Castano,Valeria De Antonellis 1997 Criteria and techniques to support the establishment of a semantic dictionary for database interoperability are described. The techniques allow the analysis of conceptual schemas of databases in a federation and the definition and maintenance of concept hierarchies. Similarity-based criteria are used to evaluate concept closeness and, consequently, to generate concept hierarchies. Experimentation of the techniques in the public administration domain is discussed. ICDE New and Forgotten Dreams in Database Research (Panel). Surajit Chaudhuri,Rakesh Agrawal,Klaus R. Dittrich,Andreas Reuter,Abraham Silberschatz,Gerhard Weikum 1997 New and Forgotten Dreams in Database Research (Panel). ICDE Subquery Elimination: A Complete Unnesting Algorithm for an Extended Relational Algebra. Pedro Celis,Hansjörg Zeller 1997 "Summary form only given, as follows. Research in the area of subquery unnesting algorithms has mostly focused on the problem of making queries more efficient at run-time by transforming subqueries into joins. Unnesting rules describe a transformation of a nested query tree or a nested SQL query into an equivalent tree or SQL query that is no longer nested. However, it is not possible to express all nested queries in a non-nested form, unless the used language (relational algebra or ISO/ANSI SQL) is extended. This means that a database system must continue to have the ability to process subqueries. When working on a new optimizer and executor design for NonStop SQL, our development team was faced with a slightly different problem: we wanted to eliminate the need for optimization and execution of nested queries altogether and were looking for a complete subquery unnesting process. Such a process would allow us to develop a query optimizer and executor that do not need to process subqueries. Our goal was to make use of the existing unnesting algorithms and to extend them in a way that does not necessarily improve or change the execution characteristics of nested queries, but that leads to complete unnesting of all forms of nested queries, as defined by the ""full"" level of the ISO/ANSI SQL92 standard. To indicate this different approach we call it ""subquery elimination"" rather than ""subquery unnesting""." ICDE Developing and Accessing Scientific Databases with the OPM Data Management Tools. I-Min A. Chen,Anthony Kosky,Victor M. Markowitz,Ernest Szeto 1997 Summary form only given. The Object-Protocol Model (OPM) data management tools provide facilities for rapid development, documentation, and flexible exploration of scientific databases. The tools are based on OPM, an object-oriented data model which is similar to the ODMG standard, but also supports extensions for modeling scientific data. Databases designed using OPM can be implemented using a variety of commercial relational DBMSs, using schema translation tools that generate complete DBMS database definitions from OPM schemas. Further, OPM schemas can be retrofitted on top of existing databases defined using a variety of notations, such as the relational data model or the ASN.1 data exchange format, using OPM retrofitting tools. Several archival molecular biology databases have been designed and implemented using the OPM tools, including the Genome Database (GDB) and the Protein Data Bank (PDB), while other scientific databases, such as the Genome Sequence Database (GSDB), have been retrofitted with semantically enhanced views using the OPM tools. ICDE The IDEA Tool Set. Stefano Ceri,Piero Fraternali,Stefano Paraboschi 1997 The IDEA Tool Set. ICDE Data Mining: Where is it Heading? (Panel). Jiawei Han 1997 Data mining is a promising field in which research and development activities are flourishing. It is also a young field with vast, unexplored territories. How can we contribute significantly to this fast expanding, multi-disciplinary field? This panel will bring database researchers together to share different views and insights on the issues in the field. The ACM Portal is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc. Terms of Usage Privacy Policy Code of Ethics Contact Us Useful downloads: Adobe Acrobat QuickTime Windows Media Player Real Player ICDE Indexing OODB Instances based on Access Proximity. Chee Yong Chan,Cheng Hian Goh,Beng Chin Ooi 1997 Queries in object-oriented databases (OODBs) may be asked with respect to different class scopes: a query may either request for object-instances which belong exclusively to a given class c, or those which belong to any class in the hierarchy rooted at c. To facilitate retrieval of objects both from a single class as well as from multiple classes in a class hierarchy, we propose a multi-dimensional class-hierarchy index called the /spl chi/-tree. The /spl chi/-tree dynamically partitions the data space using both the class and indexed attribute dimensions by taking into account the semantics of the class dimension as well as access patterns of queries. Experimental results show that it is an efficient index. ICDE Partial Video Sequence Caching Scheme for VOD Systems with Heterogeneous Clients. Y. M. Chiu,K. H. Yeung 1997 Video on Demand is one of the key application in the information era. An hinge factor to its wide booming is the huge bandwidth required to transmit digitized video to a large group of clients with widely varying requirements. This paper addresses issues due to heterogeneous clients by proposing a program caching scheme called Partial Video Sequence (PVS) Caching Scheme. PVS Caching Scheme decomposes video sequences into a number of parts by using a scalable video compression algorithm. Video parts are selected to be cached in local video servers based on the amount of bandwidth it would be demanded from the distribution network and central video server if it is only kept in central video server. In this paper, we also show that PVS Caching Scheme is suitable for handling vastly varying client requirements. ICDE NAOS Protyotype - Version 2.2. Christine Collet,Thierry Coupaye,Luc Fayolle,Claudia Roncancio 1997 Summary form only given. The Native Active Object System (NAOS) incorporates an active behavior within the object-oriented database management system O/sub 2/. NAOS rules are event-condition-action (EGA) rules belonging to an O/sub 2/ database schema. The authors focus on user and temporal event detection as well as composite event detection. ICDE Adaptive Broadcast Protocols to Support Power Conservant Retrieval by Mobile Users. Anindya Datta,Aslihan Celik,Jeong G. Kim,Debra E. VanderMeer,Vijay Kumar 1997 Mobile computing has the potential for managing information globally. Data management issues in mobile computing have received some attention in recent times, and the design of adaptive broadcast protocols has been posed as an important problem. Such protocols are employed by database servers to decide on the content of broadcasts dynamically, in response to client mobility and demand patterns. In this paper we design such protocols and also propose efficient retrieval strategies that may be employed by clients to download information from broadcasts. The goal is to design cooperative strategies between server and client to provide access to information in such a way as to minimize energy expenditure by clients. We evaluate the performance of our protocols analytically. ICDE WOL: A Language for Database Transformations and Constraints. Susan B. Davidson,Anthony Kosky 1997 "The need to transform data between heterogeneous databases arises from a number of critical tasks in data management. These tasks are complicated by schema evolution in the underlying databases and by the presence of non-standard database constraints. We describe a declarative language called WOL (Well-founded Object Logic) for specifying such transformations, and its implementation in a system called Morphase (an ""enzyme"" for morphing data). WOL is designed to allow transformations between the complex data structures which arise in object-oriented databases as well as in complex relational databases, and to allow for reasoning about the interactions between database transformations and constraints." ICDE Media Asset Management: Managing Complex Data as a Re-Engineering Exercise. Peter DeVries 1997 Building a media asset management application involves the storing, searching, and retrieving of complex data. How this data is managed can be viewed from two perspectives-in terms of the internal representation required to allow for high-speed searching and transferring of these items between systems, but also from the end-user perspective. This paper focuses on the perspective that media asset management is a re-engineering exercise whose fundamental goal is to eliminate the file system and its underlying classification model. The paper discusses the following topics: folder/director classification schemes; file system security; cross platform file transfers; traditional searching techniques and constraints; the data characteristics of a media asset management solution; an architectural mapping of elements required to manage complex data, including source, proxy, and metadata; a media asset management approach to classification including business semantics; content based search algorithms; and the feasibility of a database replacing the file system. ICDE ROCK & ROLL: A Deductive Object-Oriented Database with Active and Spatial Extensions. Andrew Dinn,M. Howard Williams,Norman W. Paton 1997 ROCK & ROLL is a deductive object-oriented database system that supports two languages, one imperative and the other deductive, both derived from the same object-oriented data model. As the languages share a common type system, they can be integrated without manifesting impedance mismatches, and thus programmers can conveniently exploit both deductive and imperative features in a single application. The basic ROCK & ROLL system provides comprehensive modelling and programming facilities, but recent work has extended it with both active rules and spatial data types, thereby demonstrating how the core design is amenable to extensions in its behavioural and structural facilities. ICDE Object Related Plus: A Practical Tool for Developing Enhanced Object Databases. Bryon K. Ehlmann,Gregory A. Riccardi 1997 Object Relater Plus is a practical tool currently being used for research and development of enhanced object databases (ODBs). The tool, which is a prototype Object Database Management System (ODBMS), provides two languages that are compatible with the ODMG-93 ODBMS standard yet enhance it in some significant ways. The Object Database Definition Language (ODDL) allows object relationships to be better defined and supported; provides for the specification and separation of external, conceptual, and internal views; and facilitates the implementation of domain specific ODB extensions. The Object Database Manipulation Language (ODML) augments ODDL by providing a C++ interface for database creation, access, and manipulation based on an ODDL specification. In this paper we give an overview of Object Relater Plus, emphasizing its salient features. We also briefly discuss its architecture and implementation and its use in developing scientific databases. ICDE System Design for Digital Media Asset Management. Pamela D. Fisher 1997 Client-server computing is only now able to deliver genuine gains to broadcasters, publishers, creative agencies and production facilities. These new users are entering the distributed computing domain just as media object technologies and commercial broadband services emerge from infancy. This paper discusses the complex process of decision-making and system design for digital media asset management. The Cinebase Digital Media Management System is described and used to illustrate critical points. The digital media server architecture must accommodate extremely large datasets, scaleable content retrieval and rapid network query response. Cinebase installations are currently in place containing 100s of Terabytes of media content, and 100s of local and remote users. Cinebase has recently been ported to ObjectStore by Object Design Inc. (ODI). The presentation discusses how, in combination, the Cinebase application and ODI extensions can be used to deliver a complete object management environment for production-quality content. Examples of the new workflows, and the technical issues complex networks raise for database architecture, are also discussed. ICDE Quantifying Complexity and Performance Gains of Distributed Caching in a Wireless Mobile Computing Environment. Cedric C. F. Fong,John C. S. Lui,Man Hon Wong 1997 "In a mobile computing system, the wireless communication bandwidth is a scarce resource that needs to be managed carefully. In this paper, we investigate the use of distributed caching as an approach to reduce the wireless bandwidth consumption for data access. We find that conventional caching techniques cannot fully utilize the dissemination feature of the wireless channel. We thus propose a novel distributed caching protocol that can minimize the overall system bandwidth consumption at the cost of CPU processing time at the server side. This protocol allows the server to select data items into a broadcast set, based on a performance gain parameter called the bandwidth gain, and then send the broadcast set to all the mobile computers within the server's cell. We show that in general, this selection process is NP-hard, and therefore we propose a heuristic algorithm that can attain a near-optimal performance. We also propose an analytical model for the protocol and derive closed-form performance measures, such as the bandwidth utilization and the expected response time of data access by mobile computers. Experiments show that our distributed caching protocol can greatly reduce the bandwidth consumption so that the wireless network environment can accommodate more users and, at the same time, vastly improve the expected response time for data access by mobile computers." ICDE Teaching an OLTP Database Kernel Advanced Data Warehousing Techniques. Clark D. French 1997 Most, if not all, of the major commercial database products available today were written more than 10 years ago. Their internal designs have always been heavily optimized for OLTP applications. Over the last couple of years as DSS and data warehousing have become more important, database companies have attempted to increase their performance with DSS-type applications. Most of their attempts have been in the form of added features like parallel table scans and simple bitmap indexing techniques. These were chosen because they could be quickly implemented (1-2 years), giving some level of increased query performance. The paper contends that the real performance gains for the DSS application have not yet been realized. The performance gains for DSS will not come from parallel table scans, but from major changes to the low level database storage management used by OLTP systems. One Sybase product, Sybase-IQ has pioneered some of these new techniques. The paper discusses a few of these techniques and how they could be integrated into an existing OLTP database kernel. ICDE FLORID: A Prototype for F-Logic. Jürgen Frohn,Rainer Himmeröder,Paul-Thomas Kandzia,Georg Lausen,Christian Schlepphorst 1997 FLORID - F-LOgic Reasoning In Databases - is a deductive object-oriented database system incorporating F-logic as data definition and query language and combining the advantages of deductive databases with the rich modelling capabilities of object oriented concepts. F-logic provides complex objects, uniform handling of data and metadata, rule-defined class hierarchy and signatures, non-monotonic multiple inheritance, equating of objects by rules and variables ranging over methods and classes. Moreover, FLORID extends F-logic by path expressions to facilitate object navigation. ICDE Interfacing Parallel Applications and Parallel Databases. Vibby Gottemukkala,Anant Jhingran,Sriram Padmanabhan 1997 The use of parallel database systems to deliver high performance has become quite common. Although queries submitted to these database systems are executed in parallel, the interaction between applications and current parallel database systems is serial. As the complexity of the applications and the amount of data they access increases, the need to parallelize applications also increases. In this parallel application environment, a serial interface to the database could become the bottleneck in the performance of the application. Hence, parallel database systems should support interfaces that allow the applications to interact with the database system in parallel. We present a taxonomy of such parallel interfaces, namely the Single Coordinator, Multiple Coordinator, Hybrid Parallel, and Pure Parallel interfaces. Furthermore, we discuss how each of these interfaces can be realized and in the process introduce new constructs that enable the implementation of the interfaces. We also qualitatively evaluate each of the interfaces with respect to their restrictiveness and performance impact. ICDE Semantic Query Optimization for Object Databases. John Grant,Jarek Gryz,Jack Minker,Louiqa Raschid 1997 Semantic Query Optimization for Object Databases. ICDE Distributing Semantic Constraints Between Heterogeneous Databases. Stefan Grufman,Fredrik Samson,Suzanne M. Embury,Peter M. D. Gray,Tore Risch 1997 In recent years, research on distributing databases over networks has become increasingly important. In this paper, we concentrate on the issues of the interoperability of heterogeneous DBMSs and enforcing integrity across a multi-database made in this fashion. This has been done through a cooperative project between Aberdeen and Linko/spl uml/ping universities, with database modules distributed between the sites. In the process, we have shown the advantage of using DBMSs based on variants of the functional data model (FDM), which has made it remarkably straightforward to interoperate queries and schema definitions. Further, we have used the constraint transformation facilities of P/FDM (Prolog implementation of FDM) to compile global constraints into active rules installed locally on one or more AMOS (Active Mediators Object System) servers. We present the theory behind this, and the conditions for it to improve performance. ICDE Index Selection for OLAP. Himanshu Gupta,Venky Harinarayan,Anand Rajaraman,Jeffrey D. Ullman 1997 "On-line analytical processing (OLAP) is a recent and important application of database systems. Typically, OLAP data is presented as a multidimensional ""data cube."" OLAP queries are complex and can take many hours or even days to run, if executed directly on the raw data. The most common method of reducing execution time is to precompute some of the queries into summary tables (subcubes of the data cube) and then to build indexes on these summary tables. In most commercial OLAP systems today, the summary tables that are to be precomputed are picked first, followed by the selection of the appropriate indexes on them. A trial-and-error approach is used to divide the space available between the summary tables and the indexes. This two-step process can perform very poorly. Since both summary tables and indexes consume the same resource-space-their selection should be done together for the most efficient use of space. The authors give algorithms that automate the selection of summary tables and indexes. In particular, they present a family of algorithms of increasing time complexities, and prove strong performance bounds for them. The algorithms with higher complexities have better performance bounds. However, the increase in the performance bound is diminishing, and they show that an algorithm of moderate complexity can perform fairly close to the optimal." ICDE Oracle Parallel Warehouse Server. Gary Hallmark 1997 "Oracle is the leading supplier of data warehouse servers, yet little has been published about Oracle's parallel warehouse architecture. After a brief review of Oracle's market, performance, and platform strengths, we present two novel features of the Oracle parallel database architecture. First, the data flow model achieves scalability while using a fixed number of threads that is independent of the complexity of the query plan. Second, a new ""load shipping"" architecture combines the best aspects of data shipping and function shipping, and runs on shared everything, shared disk, and shared nothing hardware." ICDE Improving the Quality of Technical Data for Developing Case Based Reasoning Diagnostic Software for Aircraft Maintenance. Richard Heider 1997 Summary form only given. Time spent by airline maintenance operators to solve engine failures and the related costs (flight delays or cancellations) are a major concern to SNECMA which manufacture engines for civilian aircraft such as BOEING 737s and Airbus A340s. The use of an intelligent diagnostic software contributes to improving customer support and reduces the cost of ownership by improving troubleshooting accuracy and reducing airplane downtime. However, classical rule based or model based expert systems are costly to develop and maintain. Our goal has been to improve the development of troubleshooting systems through case based reasoning (CBR) and data mining. These technologies reason from past cases, whose solution is known, rather than rules. New problems are solved by searching for similar problem solving experiences and by adapting the solutions that worked in the past, Our second objective was to acquire the capacity to produce systems which match the quality standard in the aeronautic industry in the given time frame. We aim at both assuring the quality of the core data mining and CBR software as well as the quality of the technical information that is fed into the system (case knowledge). ICDE The CORD Appraoch to Extensible Concurrency Control. George T. Heineman,Gail E. Kaiser 1997 Database management systems (DBMSs) have been increasingly used for advanced application domains, such as software development environments, workflow management systems, computer-aided design and manufacturing, and managed healthcare. In these domains, the standard correctness model of serializability is often too restrictive. We introduce the notion of a Concurrency Control Language (CCL) that allows a database application designer to specify concurrency control policies to tailor the behavior of a transaction manager. A well-crafted set of policies defines an extended transaction model. The necessary semantic information required by the CCL run-time engine is extracted from a task manager, a (logical) module by definition included in all advanced applications. This module stores task models that encode the semantic information about the transactions submitted to the DBMS. We have designed a rule-based CCL, called CORD, and have implemented a run-time engine that can be hooked to a conventional transaction manager to implement the sophisticated concurrency control required by advanced database applications. We present an architecture for systems based on CORD and describe how we integrated the CORD engine with the Exodus Storage Manager to implement Altruistic Locking. ICDE Integrated Query Processing Strategies for Spatial Path Queries. Yun-Wu Huang,Ning Jing,Elke A. Rundensteiner 1997 We investigate optimization strategies for processing path queries with embedded spatial constraints, such as avoiding areas with certain characteristics. To resolve complex spatial constraints during path finding, we consider two decisions: (1) the spatial relation operations (e.g., intersect) between areas and links can be preprocessed or intermixed with path finding and (2) areas satisfying the query constraint can be prefiltered or dynamically selected during path finding. Based on these two decisions, we propose and implement the resulting four integrated query processing strategies, utilizing state-of-the-art technologies such as spatial joins for intersect computation, R-tree access structure for spatial overlap search, and spatial clustering for efficient path search. In this paper, we also report an experimental evaluation to show which strategies perform best in different scenarios. ICDE Scalable Versioning in Distributed Databases with Commuting Updates. H. V. Jagadish,Inderpal Singh Mumick,Michael Rabinovich 1997 "We present a multiversioning scheme for a distributed system with the workload consisting of read-only transactions and update transactions, (most of) which commute on individual nodes. The scheme introduces a version advancement protocol that is completely asynchronous with user transactions, thus allowing the system to scale to very high transaction rates and frequent version advancements. Moreover, the scheme never creates more than three copies of a data item. Combined with existing techniques to avoid global concurrency control for commuting transactions that execute in a particular version, our multiversioning scheme results in a protocol where no user transaction on a node can be delayed by any activity (either version advancement or another transaction) occurring on another node. Non-commuting transactions are gracefully handled. Our technique is of particular value to distributed recording systems where guaranteeing global serializability is often desirable, but rarely used because of the high performance cost of running distributed transactions. Examples include calls on a telephone network, inventory management in a ""point-of-sale'' system, operations monitoring systems in automated factories, and medical information management systems." ICDE A Persistent Hyper-Programming System. Graham N. C. Kirby,Ronald Morrison,David S. Munro,Richard C. H. Connor,Quintin I. Cutts 1997 We demonstrate the use of a hyper-programming system in building persistent applications. This allows program representations to contain type-safe links to persistent objects embedded directly within the source code. The benefits include improved efficiency and potential for static program checking, reduced programming effort and the ability to display meaningful source-level representations for first-class procedure values. Hyper-programming represents a completely new style of programming which is only possible in a persistent programming system. ICDE A Priority Ceiling Protocol with Dynamic Adjustment of Serialization Order. Kwok-Wa Lam,Sang Hyuk Son,Sheung-lun Hung 1997 The difficulties of providing a guarantee of meeting transaction deadlines in hard real-time database systems lie in the problems of priority inversion and of deadlocks. Priority inversion and deadlock problems ensue when concurrency control protocols are adapted in priority-driven scheduling. The blocking delay due to priority inversion can be unbounded, which is unacceptable in the mission-critical real-time applications. Some priority ceiling protocols have been proposed to tackle these two problems. However, they are too conservative in scheduling transactions for the single-blocking and deadlock-free properties, leading to many unnecessary transaction blockings. In this paper, we analyze the unnecessary transaction blocking problem inherent in these priority ceiling protocols and investigate the conditions for allowing a higher priority transaction to preempt a lower priority transaction using the notion of dynamic adjustment of serialization order. A new priority ceiling protocol is proposed to solve the unnecessary blocking problem, thus enhancing schedulability. We also devise the worst-case schedulability analysis for the new protocol which provides a better schedulability condition than other protocols. ICDE Physical Database Design for Data Warehouses. Wilburt Labio,Dallan Quass,Brad Adelberg 1997 Data warehouses collect copies of information from remote sources into a single database. Since the remote data is cached at the warehouse, it appears as local relations to the users of the warehouse. To improve query response time, the warehouse administrator will often materialize views defined on the local relations to support common or complicated queries. Unfortunately, the requirement to keep the views consistent with the local relations creates additional overhead when the remote sources change. The warehouse is often kept only loosely consistent with the sources: it is periodically refreshed with changes sent from the source. When this happens, the warehouse is taken off-line until the local relations and materialized views can be updated. Clearly, the users would prefer as little down time as possible. Often the down time can be reduced by adding carefully selected materialized views or indexes to the physical schema. This paper studies how to select the sets of supporting views and of indexes to materialize to minimize the down time. We call this the view index selection (VIS) problem. We present an A* search based solution to the problem as well as rules of thumb. We also perform additional experiments to understand the space-time tradeoff as it applies to data warehouses. ICDE Modeling Business Rules with Situation/Activation Diagrams. Peter Lang,Werner Obermair,Michael Schrefl 1997 "Business rules are statements about business policies and can be formulated according to the event-condition-action structure of rules in active database systems. However, modeling business rules at the conceptual level from an external user's perspective requires a different set of concepts than currently provided by active database systems. This paper identifies requirements on the event language and on the semantics of rule execution for modeling business rules and presents a graphical object-oriented language, called Situation/Activation diagrams, meeting these requirements." ICDE W3QS - A System for WWW Querying. David Konopnicki,Oded Shmueli 1997 "W3QL is a, SQL-like, high level language for accessing World-Wide Web (WWW) resident data and services. W3QL is declarative. A W3QL query specifies a graph to be matched with portions of the WWW (graph nodes corresponding to WWW pages, edges to hypertext links). A query can specify complex conditions on nodes' contents and their relationships. A W3QL query may use existing search services (e.g. AltaVista). W3QL is extensible as users may use their own data analysis tools (e.g. image analysis). W3QS is a system that manages W3QS queries. W3QS is accessible via the WWW or by using a programming based interface (API). On the WWW, W3QS provides several interfaces: intuitive graphic interfaces, templates of frequently posed queries, and direct programming." ICDE A Propagation Mechanism for Populated Schema Versions. Sven-Eric Lautemann 1997 Object-oriented database systems (OODBMS) offer powerful modeling concepts as required by advanced application domains like CAD/CAM/CAE or office automation. Typical applications have to handle large and complex structured objects which frequently change their value and their structure. As the structure is described in the schema of the database, support for schema evolution is a highly required feature. Therefore, a set of schema update primitives must be provided which can be used to perform the required changes, even in the presence of populated databases and running applications. In this paper, we use the versioning approach to schema evolution to support schema updates as a complex design task. The presented propagation mechanism is based on conversion functions that map objects between different types and can be used to support schema evolution and schema integration. ICDE Clustering Association Rules. Brian Lent,Arun N. Swami,Jennifer Widom 1997 The authors consider the problem of clustering two-dimensional association rules in large databases. They present a geometric-based algorithm, BitOp, for performing the clustering, embedded within an association rule clustering system, ARCS. Association rule clustering is useful when the user desires to segment the data. They measure the quality of the segmentation generated by ARCS using the minimum description length (MDL) principle of encoding the clusters on several databases including noise and errors. Scale-up experiments show that ARCS, using the BitOp algorithm, scales linearly with the amount of data. ICDE Buffer and I/O Resource Pre-allocation for Implementing Batching and Buffering Techniques for Video-on-Demand Systems. M. Y. Y. Leung,John C. S. Lui,Leana Golubchik 1997 To design a cost effective VOD server, it is important to carefully manage the system resources so that the number of concurrent viewers can be maximized. Previous research results use data sharing techniques, such as batching, buffering and piggybacking, to reduce the demand for I/O resources in a VOD system. However, these techniques still suffer from the problem that additional I/O resources are needed in the system for providing VCR functionality --- without careful resource management, the benefits of these data sharing techniques can be lost. In this paper, we first introduce a model for determining the amount of resources required for supporting both normal playback and VCR functionality to satisfy predefined performance characteristics. Consequently, this model allows us to maximize the benefits of data sharing techniques. Furthermore, one important application of this model is its use in making system sizing decisions. Proper system sizing will result in a more cost-effective VOD system. ICDE STR: A Simple and Efficient Algorithm for R-Tree Packing. Scott T. Leutenegger,J. M. Edgington,Mario A. Lopez 1997 In this paper we present the results from an extensive comparison study of three R-tree packing algorithms, including a new easy to implement algorithm. The algorithms are evaluated using both synthetic and actual data from various application domains including VLSI design, GIS (tiger), and computational fluid dynamics. Our studies also consider the impact that various degrees of buffering have on query performance. Experimental results indicate that none of the algorithms is best for all types of data. In general, our new algorithm requires up to 50\% fewer disk accesses than the best previously proposed algorithm for point and region queries on uniformly distributed or mildly skewed point and region data, and approximately the same for highly skewed point and region data. ICDE Delegation: Efficiently Rewriting History. Cris Pedregal Martin,Krithi Ramamritham 1997 Transaction delegation, as introduced in ACTA, allows a transaction to transfer responsibility for the operations that it has performed on an object to another transaction. Delegation can be used to broaden the visibility of the delegatee, and to tailor the recovery properties of a transaction model. Delegation has been shown to be useful in synthesizing advanced transaction models. With an efficient implementation of delegation it becomes practicable to realize various advanced transaction models whose requirements are specified at a high level language instead of the current expensive practice of building them from scratch. The authors identify the issues in efficiently supporting delegation and hence advanced transaction models, and illustrate this with our solution in ARIES, an industrial-quality system that uses UNDO/REDO recovery. Since delegation is tantamount to rewriting history, a naive implementation can entail frequent, costly log accesses, and can result in complicated recovery protocols. The algorithm achieves the effect of rewriting history without rewriting the log, resulting in an implementation that realizes the semantics of delegation at minimal additional overhead and incurs no overhead when delegation is not used. The work indicates that it is feasible to build efficient and robust, general-purpose machinery for advanced transaction models. It is also a step towards making recovery a first-class concept within advanced transaction models. ICDE The Multikey Type Index for Persistent Object Sets. Thomas A. Mück,Martin L. Polaschek 1997 Multikey index structures for type hierarchies are a recently discussed alternative to traditional B/sup +/-tree indexing schemes. We describe an efficient implementation of this alternative called the multikey type index (MT-index). A prerequisite for our approach is an optimal linearization of the type hierarchy that allows us to map queries in object type hierarchies to minimal-volume range queries in multi-attribute search structures. This provides access to an already-existing large and versatile tool-box. The outline of an index implementation by means of a multi-attribute search structure (e.g. the hB-tree or any other structure with comparable performance) is followed by an analytical performance evaluation. Selected performance figures are compared to previous approaches, in particular to the H-tree and the class hierarchy tree. The comparison results allow for practically relevant conclusions with respect to index selection based on query profiles. ICDE Relational Joins for Data on Tertiary Storage. Jussi Myllymaki,Miron Livny 1997 Despite the steady decrease in secondary storage prices, the data storage requirements of many organizations cannot be met economically using secondary storage alone. Tertiary storage offers a lower-cost alternative but is viewed as a second-class citizen in many systems. For instance, the typical solution in bringing tertiary-resident data under the control of a DBMS is to use operating system facilities to copy the data to secondary storage, and then to perform query optimization and execution as if the data had been in secondary storage all along. This approach fails to recognize the opportunities for saving execution time and storage space if the data were accessed directly on tertiary devices and in parallel with other I/Os. In this paper we explore how to join two DBMS relations stored on magnetic tapes. Both relations are assumed to be larger than available disk space. We show how Grace Hash Join can be modified to handle a range of tape relation sizes. The modified algorithms access data directly on tapes and exploit parallelism between disk and tape I/Os. We also provide performance results of an experimental implementation of the algorithms. ICDE Active Customization of GIS User Interfaces. Juliano Lopes de Oliveira,Claudia Bauzer Medeiros,Mariano Cilia 1997 This paper presents a new approach to user interface customization in Geographic Information Systems (GIS). This approach is based on the integration of three main components: a GIS user interface architecture; an active database mechanism; and a generic interface builder. The GIS interface architecture provides the default interface behavior, while the active system allows customization of interfaces according to the specific context. The generic interface builder relies on a library of interface objects to dynamically construct generic and customized interfaces. The main advantage of this approach is that it decreases the costs associated with developing customized \gis\ interfaces. ICDE Representative Objects: Concise Representations of Semistructured, Hierarchial Data. Svetlozar Nestorov,Jeffrey D. Ullman,Janet L. Wiener,Sudarshan S. Chawathe 1997 Introduces the concept of representative objects, which uncover the inherent schema(s) in semi-structured, hierarchical data sources and provide a concise description of the structure of the data. Semi-structured data, unlike data stored in typical relational or object-oriented databases, does not have a fixed schema that is known in advance and stored separately from the data. With the rapid growth of the World Wide Web, semi-structured hierarchical data sources are becoming widely available to the casual user. The lack of external schema information currently makes browsing and querying these data sources inefficient at best, and impossible at worst. We show how representative objects make schema discovery efficient and facilitate the generation of meaningful queries over the data. ICDE "Databases and the Web: What's in it for Databases? (Panel)." Erich J. Neuhold,Karl Aberer 1997 "Databases and the Web: What's in it for Databases? (Panel)." ICDE Periodic Retrieval of Videos from Disk Arrays. Banu Özden,Rajeev Rastogi,Abraham Silberschatz 1997 A growing number of applications need access to video data stored in digital form on secondary storage devices (e.g., video-on-demand, multimedia messaging). As a result, video servers that are responsible for the storage and retrieval, at fixed rates, of hundreds of videos from disks are becoming increasingly important. Since video data tends to be voluminous, several disks are usually used in order to store the videos. A challenge is to devise schemes for the storage and retrieval of videos that distribute the workload evenly across disks, reduce the cost of the server and at the same time, provide good response times to client requests for video data. In this paper, we present schemes that retrieve videos periodically from disks in order to provide better response times to client requests. We present two schemes that stripe videos across multiple disks in order to distribute the workload uniformly among them. For the two striping schemes, we show that the problem of retrieving videos periodically is equivalent to that of scheduling periodic tasks on a multiprocessor. For the multiprocessor scheduling problems, we present and compare schemes for computing start times for the tasks, if it is determined that they are scheduleable. ICDE Adding Full Text Indexing to the Operating System. Kyle Peltonen 1997 "Many challenges must be faced when incorporating full text retrieval into the operating system. The search engine must be a nearly invisible, natural extension to the operating system, just like the file system and the network. The search engine must meet user expectations of an operating system, specifically in areas such as performance, fault tolerance, and security. It must handle a very heterogeneous collection of documents, in many formats, many languages and many styles. The search engine must scale with the operating system, from small laptop computers to large multiprocessor servers. The paper is an overview of the challenges faced when incorporating full text indexing into the Microsoft Windows NT/sup TM/ operating system. Specific solutions used by the Microsoft 'Tripoli' search engine, are offered." ICDE A Rule Engine for Query Transformation in Starburst and IBM DB2 C/S DBMS. Hamid Pirahesh,T. Y. Cliff Leung,Waqar Hasan 1997 The complexity of queries in relational DBMSs is increasing, particularly in the decision support area and interactive client sewer environments. This calls for a more powerful and flexible optimization of complex queries. H. Pirahesh et al. (1992) introduced query rewrite as a distinct query optimization phase mainly targeted to responding to this requirement. This approach has enabled us to extensively enrich the optimization rules in our system. Further, it has made it easier to incrementally enrich and adapt the system as need arises. Examples of such query optimizations are predicate pushdown, subquery and magic sets transformations, and decorrelating subquery. We describe the design and implementation of a rule engine for query rewrite optimization. Each transformation is implemented as a rule which consists of a pair of rule condition and action. Rules can be grouped into rule classes for higher efficiency, better understandability and more extensibility. The rule engine has a number of novelties in that it supports a full spectrum of control-from totally data driven to totally procedural. Furthermore, it incorporates a budget control scheme for controlling the resources taken for query optimization as well as guaranteeing the termination of rule execution. The rule engine and a suite of query rewrite rules have been implemented in Starburst relational DBMS prototype and a significant portion of this technology has been integrated into IBM DB2 Common Server relational DBMS. ICDE High-Dimensional Similarity Joins. Kyuseok Shim,Ramakrishnan Srikant,Rakesh Agrawal 1997 Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the $\epsilon$ tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the $\epsilon$ tree and the R-tree family, and show that the $\epsilon$ tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the $\epsilon$ tree is twice to an order of magnitude faster than the $R^+$ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the $\epsilon$ tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the $\epsilon$ tree. ICDE A Cost-Model-Based Online Method for Ditributed Caching. Markus Sinnwell,Gerhard Weikum 1997 "The paper presents a method for distributed caching to exploit the aggregate memory of networks of workstations in data-intensive applications. In contrast to prior work, the approach is based on a detailed cost model as the basis for optimizing the placement of variable-size data objects in a distributed, possibly heterogeneous two-level storage hierarchy. To address the online problem with a priori unknown and evolving workload parameters, the method employs dynamic load tracking procedures and an approximative, low-overhead version of the cost model for continuous reoptimization steps that are embedded in the decisions of the underlying local cache managers. The method is able to automatically find a good tradeoff between an ""egoistic"" and an ""altruistic"" behavior of the network nodes, and proves its practical viability in a detailed simulation study under a variety of workload and system configurations." ICDE Modeling and Querying Moving Objects. A. Prasad Sistla,Ouri Wolfson,Sam Chamberlain,Son Dao 1997 In this paper we propose a data model for representing moving objects in database systems. It is called the Moving Objects Spatio-Temporal (MOST) data model. We also propose Future Temporal Logic (FTL) as the query language for the MOST model, and devise an algorithm for processing FTL queries in MOST. ICDE Similarity Based Retrieval of Videos. A. Prasad Sistla,Clement T. Yu,R. Venkatasubrahmanian 1997 Similarity Based Retrieval of Videos. ICDE Designing the Reengineering Services for the DOK Federated Database System. Zahir Tari,John Stokes 1997 Addresses the design of the reengineering service for the DOK (Distributed Object Kernel) federated database. This service allows the hiding of the heterogeneity of databases involved in a federation by generating object-oriented representations from their corresponding schemata. We propose a complete methodology that supports the identification and the translation of both the explicit and implicit information. The identification of object-oriented constructs is performed by classifying a relational schema into different categories of relations, namely base, dependent and composite relations. The main difficulty in designing the reengineering service relies on the distinction between the different types of relationships amongst classes. Our approach deals with this problem by analysing relations according two types of correlation: (i) the degree of correlation between the external and primary keys, and (ii) the degree of correlation between sets of tuples in the relations. Examining these correlations uncovers implicit relationships contained as well-hidden classes in a relational schema. ICDE Graphical Tools for Rule Development in the Active DBMS SAMOS. Anca Vaduva,Stella Gatziu,Klaus R. Dittrich 1997 "Summary form only given. Active database management systems (active DBMS) support the definition, management and execution of event/condition/action rules specifying reactive application behavior. Although the advantages of active mechanisms are nowadays well known, there is still no wide use in practice. One main problem is that especially for large rule sets, defined by different persons at different points in time, potential conflicts and dependencies between rules are hard to predict and rule behavior is difficult to control. Therefore, tools are needed to assist the development and maintenance of rule bases. These tools should provide for graphical interfaces supporting both, ""static"" activities (performed during rule specification) such as rule editing, browsing, design, rule analysis, and ""dynamic"" activities (performed at runtime, during the execution of an application) such as testing, debugging and understanding of rule behavior. The aim of the article is to show the use of three of these tools, namely the rule editor, the browser and the termination analyzer in the process of developing applications for the active object oriented DBMS SAMOS." ICDE Memory Management for Scalable Web Data Servers. Shivakumar Venkataraman,Miron Livny,Jeffrey F. Naughton 1997 Popular web sites are already experiencing very heavy loads, and these loads will only increase as the number of users accessing them grows. These loads create both CPU and I/O bottlenecks. One promising solution already being employed to eliminate the CPU bottleneck is to replace a single processor server with a cluster of servers. Our goal in this paper is to develop buffer management algorithms that exploit the aggregate memory capacity of the machines in such a server cluster to attack the I/O bottleneck. The key challenge in designing such buffer management algorithms turns out to be controlling data replication so as to achieve a good balance between intra-cluster network traffic and disk I/O. At one extreme, the straightforward application of client-server memory management techniques to this cluster architecture causes duplication in memory among the servers and this tends to reduce network traffic but increases disk I/O, whereas at the other extreme, eliminating all duplicates tends to increase network traffic while reducing disk I/O. Accordingly, we present a new algorithm, Hybrid, that dynamically controls the amount of duplication. Through a detailed simulation, we show that on workloads characteristic of those experienced by Web servers, the Hybrid algorithm correctly trades off intra-cluster network traffic and disk I/O to minimize average response time. ICDE Data Integration and Interrogation. J. Verso 1997 "Summary form only given. One major concern of the Verso group at Inria is the development of technology for data integration and interrogation, especially for non traditional data formats such as structured text. The article describes aspects of Verso's technology as partly sponsored by the European Community (AQUARELLE project, Esprit IV projects OPAL and WIRE)." ICDE The WHIPS Prototype for Data Warehouse Creation and Maintenance. Janet L. Wiener,Himanshu Gupta,Wilburt Labio,Yue Zhuge,Hector Garcia-Molina 1997 Summary form only given. The goal of the Whips project (WareHousing Information Project at Stanford) is to develop algorithms and tools for the creation and maintenance of a data warehouse (J. Wiener et al., 1996). In particular, we have developed an architecture and implemented a prototype for identifying data changes at distributed heterogeneous sources, transforming them and summarizing them in accordance with warehouse specifications, and incrementally integrating them into the warehouse. In effect, the warehouse stores materialized views of the source data. The Whips architecture is designed specifically to fulfil several important and interrelated goals: sources and warehouse views can be added and removed dynamically; it is scalable by adding more internal modules; changes at the sources are detected automatically; the warehouse may be updated continuously as the sources change, without requiring down time; and the warehouse is always kept consistent with the source data by the integration algorithms. The Whips system is composed of many distinct modules that potentially reside on different machines. Each module is implemented as a CORBA object. They communicate with each other using ILU, a COBRA compliant object library developed by Xerox PARC. ICDE A Data Model and Semantics of Objects with Dynamic Roles. Raymond K. Wong,H. Lewis Chau,Frederick H. Lochovsky 1997 Although the concept of roles is becoming a popular research issue in object-oriented databases and has been proven to be useful for dynamic and evolving applications, it has only been described conceptually in most of the previous work. Moreover, the important issues such as the semantics of roles (e.g., message passing) are seldom discussed. Furthermore, none of the previous work has investigated the idea of role player qualification, which models the fact that not every object is qualified to play a particular role. In this paper, we present a data model and the semantics of roles. We discuss each of the above issues and illustrate the ideas with examples. From these examples, we can easily see that the problems we discussed are fundamental and indeed exist in many complex applications. ICDE Supporting Fine-grained Data Lineage in a Database Visualization Environment. Allison Woodruff,Michael Stonebraker 1997 The lineage of a datum records its processing history. Because such information can be used to trace the source of anomalies and errors in processed data sets, it is valuable to users for a variety of applications, including the investigation of anomalies and debugging. Traditional data lineage approaches rely on metadata. However, metadata does not scale well to fine-grained lineage, especially in large data sets. For example, it is not feasible to store all of the information that is necessary to trace from a specific floating-point value in a processed data set to a particular satellite image pixel in a source data set. In this paper, we propose a novel method to support fine-grained data lineage. Rather than relying on metadata, our approach lazily computes the lineage using a limited amount of information about the processing operators and the base data. We introduce the notions of weak inversion and verification. While our system does not perfectly invert the data, it uses weak inversion and verification to provide a number of guarantees about the lineage it generates. We propose a design for the implementation of weak inversion and verification in an object-relational database management system. ICDE Selectivity Estimation in the Presence of Alphanumeric Correlations. Min Wang,Jeffrey Scott Vitter,Balakrishna R. Iyer 1997 Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P, we need to estimate the fraction of records in the database that satisfy P. Almost all previous work dealt with the estimation of numeric selectivity, i.e., the query contains only numeric variables. The general problem of estimating alphanumeric selectivity is much more difficult and has attracted attention only very recently, and the focus has been on the special case when only one column is involved. In this paper, we consider the more general case when there are two correlated alphanumeric columns. We develop efficient algorithms to build storage structures that can fit in a database catalog. Results from our extensive experiments to test our algorithms, on the basis of error analysis and space requirements, are given to guide DBMS implementors. ICDE Content is King, (If You Can Find It): A New Model for Knowledge Storage and Retrieval. Fred L. Wurden 1997 The technology for acquiring and storing vast amounts of complex data is accelerating at a much faster rate than the technology for retrieving and analyzing that data. While progress has been made with OODB, OLAP, and knowledge discovery (KD) systems, users of these systems are still required to know and supply missing semantic information. When dealing with complex real-world representations, this is often nearly impossible to do. We discuss a new model that provides significant improvements in storing, correlating, and navigating information. We first provide a brief background looking at other relevant knowledge representation approaches, then describe our patented Contiguous Connection Model. Finally we discuss the impact this technology has had on a large, high-value, digital media knowledge base. ICDE Multiple View Consistency for Data Warehousing. Yue Zhuge,Hector Garcia-Molina,Janet L. Wiener 1997 A data warehouse stores integrated information from multiple distributed data sources. In effect, the warehouse stores materialized views over the source data. The problem of ensuring data consistency at the warehouse can be divided into two components: ensuring that each view reflects a consistent state of the base data, and ensuring that multiple views are mutually consistent. In this paper we study the latter problem, that of guaranteeing multiple view consistency (MVC). We identify and define formally three layers of consistency for materialized views in a distributed environment. We present a scalable architecture for consistently handling multiple views in a data warehouse, which we have implemented in the WHIPS(WareHousing Information Project at Stanford) prototype. Finally, we develop simple, scalable, algorithms for achieving MVC at a warehouse. SIGMOD Conference MDM: a Multiple-Data-Model Tool for the Management of Heterogeneous Database Schemes. Paolo Atzeni,Riccardo Torlone 1997 MDM is a tool that enables the users to define schemes of different data models and to perform translations of schemes from one model to another. These functionalities can be at the basis of a customizable and integrated CASE environment supporting the analysis and design of information systems. MDM has two main components: the Model Manager and the Schema Manager. The Model Manager supports a specialized user, the model engineer, in the definition of a variety of models, on the basis of a limited set of metaconstructs covering almost all known conceptual models. The Schema Manager allows designers to create and modify schemes over the defined models, and to generate at each time a translation of a scheme into any of the data models currently available. Translations between models are automatically derived, at definition time, by combining a predefined set of elementary transformations, which implement the standard translations between simple combinations of constructs. SIGMOD Conference The STRIP Rule System For Efficiently Maintaining Derived Data. Brad Adelberg,Hector Garcia-Molina,Jennifer Widom 1997 "Derived data is maintained in a database system to correlate and summarize base data which records real world facts. As base data changes, derived data needs to be recomputed. This is often implemented by writing active rules that are triggered by changes to base data. In a system with rapidly changing base data, a database with a standard rule system may consume most of its resources running rules to recompute data. This paper presents the rule system implemented as part of the STandard Real-time Information Processor (STRIP). The STRIP rule system is an extension of SQL3-type rules that allows groups of rule actions to be batched together to reduce the total recomputation load on the system. In this paper we describe the syntax and semantics of the STRIP rule system, present an example set of rules to maintain stock index and theoretical option prices in a program trading application, and report the results of experiments performed on the running system. The experiments verify that STRIP's rules allow much more efficient derived data maintenance than conventional rules without batching." SIGMOD Conference Balancing Push and Pull for Data Broadcast. Swarup Acharya,Michael J. Franklin,Stanley B. Zdonik 1997 The increasing ability to interconnect computers through internet-working, wireless networks, high-bandwidth satellite, and cable networks has spawned a new class of information-centered applications based on data dissemination. These applications employ broadcast to deliver data to very large client populations. We have proposed the Broadcast Disks paradigm [Zdon94, Acha95b] for organizing the contents of a data broadcast program and for managing client resources in response to such a program. Our previous work on Broadcast Disks focused exclusively on the “push-based” approach, where data is sent out on the broadcast channel according to a periodic schedule, in anticipation of client requests. In this paper, we study how to augment the push-only model with a “pull-based” approach of using a backchannel to allow clients to send explicit requests for data to the server. We analyze the scalability and performance of a broadcast-based system that integrates push and pull and study the impact of this integration on both the steady state and warm-up performance of clients. Our results show that a client backchannel can provide significant performance improvement in the broadcast environment, but that unconstrained use of the backchannel can result in scalability problems due to server saturation. We propose and investigate a set of three techniques that can delay the onset of saturation and thus, enhance the performance and scalability of the system. SIGMOD Conference Efficient View Maintenance at Data Warehouses. Divyakant Agrawal,Amr El Abbadi,Ambuj K. Singh,Tolga Yurek 1997 We present incremental view maintenance algorithms for a data warehouse derived from multiple distributed autonomous data sources. We begin with a detailed framework for analyzing view maintenance algorithms for multiple data sources with concurrent updates. Earlier approaches for view maintenance in the presence of concurrent updates typically require two types of messages: one to compute the view change due to the initial update and the other to compensate the view change due to interfering concurrent updates. The algorithms developed in this paper instead perform the compensation locally by using the information that is already available at the data warehouse. The first algorithm, termed SWEEP, ensures complete consistency of the view at the data warehouse in the presence of concurrent updates. Previous algorithms for incremental view maintenance either required a quiescent state at the data warehouse or required an exponential number of messages in terms of the data sources. In contrast, this algorithm does not require that the data warehouse be in a quiescent state for incorporating the new views and also the message complexity is linear in the number of data sources. The second algorithm, termed Nested SWEEP, attempts to compute a composite view change for multiple updates that occur concurrently while maintaining strong consistency. SIGMOD Conference Fast Parallel Similarity Search in Multimedia Databases. Stefan Berchtold,Christian Böhm,Bernhard Braunmüller,Daniel A. Keim,Hans-Peter Kriegel 1997 Most similarity search techniques map the data objects into some high-dimensional feature space. The similarity search then corresponds to a nearest-neighbor search in the feature space which is computationally very intensive. In this paper, we present a new parallel method for fast nearest-neighbor search in high-dimensional feature spaces. The core problem of designing a parallel nearest-neighbor algorithm is to find an adequate distribution of the data onto the disks. Unfortunately, the known declustering methods to not perform well for high-dimensional nearest-neighbor search. In contrast, our method has been optimized based on the special properties of high-dimensional spaces and therefore provides a near-optimal distribution of the data items among the disks. The basic idea of our data declustering technique is to assign the buckets corresponding to different quadrants of the data space to different disks. We show that our technique - in contrast to other declustering methods - guarantees that all buckets corresponding to neighboring quadrants are assigned to different disks. We evaluate our method using large amounts of real data (up to 40 MBytes) and compare it with the best known data declustering method, the Hilbert curve. Our experiments show that our method provides an almost linear speed-up and a constant scale-up. Additionally, it outperforms the Hilbert approach by a factor of up to 5. SIGMOD Conference S3: Similarity Search in CAD Database Systems. Stefan Berchtold,Hans-Peter Kriegel 1997 S3 is the prototype of a database system supporting the management and similarity retrieval of industrial CAD parts. The major goal of the system is to reduce the cost for developing and producing new parts by maximizing the reuse of existing parts. S3 supports the following three types of similarity queries: query by example (of an existing part in the database), query by sketch and thematic similarity query. S3 is an object-oriented system offering an adequate graphical user interface. On top of providing various state-of-the-art algorithms and index structures for geometry-based similarity retrieval, it is an excellent testbed for developing and testing new similarity algorithms and index structures. SIGMOD Conference High-Performance Sorting on Networks of Workstations. Andrea C. Arpaci-Dusseau,Remzi H. Arpaci-Dusseau,David E. Culler,Joseph M. Hellerstein,David A. Patterson 1997 We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records. On a 64-node cluster, we sort 6.0 GB in just under one minute, while a 32-node cluster finishes the Datamation benchmark in 2.41 seconds. Our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. We evaluate the use of commodity operating systems and hardware for parallel sorting. We find existing OS primitives for memory management and file access adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus. SIGMOD Conference InfoSleuth: Semantic Integration of Information in Open and Dynamic Environments (Experience Paper). Roberto J. Bayardo Jr.,William Bohrer,Richard S. Brice,Andrzej Cichocki,Jerry Fowler,Abdelsalam Helal,Vipul Kashyap,Tomasz Ksiezyk,Gale Martin,Marian H. Nodine,Mosfeq Rashid,Marek Rusinkiewicz,Ray Shea,C. Unnikrishnan,Amy Unruh,Darrell Woelk 1997 InfoSleuth: Semantic Integration of Information in Open and Dynamic Environments (Experience Paper). SIGMOD Conference The InfoSleuth Project. Roberto J. Bayardo Jr.,William Bohrer,Richard S. Brice,Andrzej Cichocki,Jerry Fowler,Abdelsalam Helal,Vipul Kashyap,Tomasz Ksiezyk,Gale Martin,Marian H. Nodine,Mosfeq Rashid,Marek Rusinkiewicz,Ray Shea,C. Unnikrishnan,Amy Unruh,Darrell Woelk 1997 The InfoSleuth Project. SIGMOD Conference Distance-Based Indexing for High-Dimensional Metric Spaces. Tolga Bozkaya,Z. Meral Özsoyoglu 1997 Distance-Based Indexing for High-Dimensional Metric Spaces. SIGMOD Conference The COntext INterchange Mediator Prototype. Stéphane Bressan,Cheng Hian Goh,Kofi Fynn,Marta Jessica Jakobisiak,Karim Hussein,Henry B. Kon,Thomas Lee,Stuart E. Madnick,Tito Pena,Jessica Qu,Annie W. Shum,Michael Siegel 1997 The Context Interchange strategy presents a novel approach for mediated data access in which semantic conflicts among heterogeneous systems are not identified a priori, but are detected and reconciled by a context mediator through comparison of contexts. This paper reports on the implementation of a Context Interchange Prototype which provides a concrete demonstration of the features and benefits of this integration strategy. SIGMOD Conference Beyond Market Baskets: Generalizing Association Rules to Correlations. Sergey Brin,Rajeev Motwani,Craig Silverstein 1997 One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing item A often also purchases item B.” Motivated by the goal of generalizing beyond market baskets and the association rules used with them, we develop the notion of mining rules that identify correlations (generalizing associations), and we consider both the absence and presence of items as a basis for generating rules. We propose measuring significance of associations via the chi-squared test for correlation from classical statistics. This leads to a measure that is upward closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between correlated and uncorrelated itemsets in the lattice. We develop pruning strategies and devise an efficient algorithm for the resulting problem. We demonstrate its effectiveness by testing it on census data and finding term dependence in a corpus of text documents, as well as on synthetic data. SIGMOD Conference Dynamic Itemset Counting and Implication Rules for Market Basket Data. Sergey Brin,Rajeev Motwani,Jeffrey D. Ullman,Shalom Tsur 1997 We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can improve the low-level efficiency of the algorithm. Second, we present a new way of generating “implication rules,” which are normalized based on both the antecedent and the consequent and are truly implications (not simply a measure of co-occurrence), and we show how they produce more intuitive results than other methods. Finally, we show how different characteristics of real data, as opposed by synthetic data, can dramatically affect the performance of the system and the form of the results. SIGMOD Conference The BUCKY Object-Relational Benchmark (Experience Paper). Michael J. Carey,David J. DeWitt,Jeffrey F. Naughton,Mohammad Asgarian,Paul Brown,Johannes Gehrke,Dhaval Shah 1997 The BUCKY Object-Relational Benchmark (Experience Paper). SIGMOD Conference "On Saying ""Enough Already!"" in SQL." Michael J. Carey,Donald Kossmann 1997 "On Saying ""Enough Already!"" in SQL." SIGMOD Conference Object-Relational Database Systems: Principles, Products, and Challenges (Tutorial). Michael J. Carey,Nelson Mendonça Mattos,Anil Nori 1997 Object-Relational Database Systems: Principles, Products, and Challenges (Tutorial). SIGMOD Conference SENTINEL: An Object-Oriented DBMS With Event-Based Rules. Sharma Chakravarthy 1997 SENTINEL: An Object-Oriented DBMS With Event-Based Rules. SIGMOD Conference Query Optimization at the Crossroads (Panel). Surajit Chaudhuri 1997 Query Optimization at the Crossroads (Panel). SIGMOD Conference Data Warehousing and OLAP for Decision Support (Tutorial). Surajit Chaudhuri,Umeshwar Dayal 1997 Data Warehousing and OLAP for Decision Support (Tutorial). SIGMOD Conference Meaningful Change Detection in Structured Data. Sudarshan S. Chawathe,Hector Garcia-Molina 1997 Detecting changes by comparing data snapshots is an important requirement for difference queries, active databases, and version and configuration management. In this paper we focus on detecting meaningful changes in hierarchically structured data, such as nested-object data. This problem is much more challenging than the corresponding one for relational or flat-file data. In order to describe changes better, we base our work not just on the traditional “atomic” insert, delete, update operations, but also on operations that move an entire sub-tree of nodes, and that copy an entire sub-tree. These operations allows us to describe changes in a semantically more meaningful way. Since this change detection problem is NP-hard, in this paper we present a heuristic change detection algorithm that yields close to “minimal” descriptions of the changes, and that has fewer restrictions than previous algorithms. Our algorithm is based on transforming the change detection problem to a problem of computing a minimum-cost edge cover of a bipartite graph. We study the quality of the solution produced by our algorithm, as well as the running time, both analytically and experimentally. SIGMOD Conference Supporting Multiple View Maintenance Policies. Latha S. Colby,Akira Kawaguchi,Daniel F. Lieuwen,Inderpal Singh Mumick,Kenneth A. Ross 1997 Materialized views and view maintenance are becoming increasingly important in practice. In order to satisfy different data currency and performance requirements, a number of view maintenance policies have been proposed. Immediate maintenance involves a potential refresh of the view after every update to the deriving tables. When staleness of views can be tolerated, a view may be refreshed periodically or (on-demand) when it is queried. The maintenance policies that are chosen for views have implications on the validity of the results of queries and affect the performance of queries and updates. In this paper, we investigate a number of issues related to supporting multiple views with different maintenance policies. We develop formal notions of consistency for views with different maintenance policies. We then introduce a model based on view groupings for view maintenance policy assignment, and provide algorithms, based on the viewgroup model, that allow consistency of views to be guaranteed. Next, we conduct a detailed study of the performance aspects of view maintenance policies based on an actual implementation of our model. The performance study investigates the trade-offs between different maintenance policy assignments. Our analysis of both the consistency and performance aspects of various view maintenance policies are important in making correct maintenance policy assignments. SIGMOD Conference Delaunay: A Database Visualization System. Isabel F. Cruz,Michael Averbuch,Wendy T. Lucas,Melissa Radzyminski,Kirby Zhang 1997 Visual query systems have traditionally supported a set of pre-defined visual displays. We describe the Delaunay system, which supports visualizations of object-oriented databases specified by the user with a visual constraint-based query language. The highlights of our approach are the expressiveness of the visual query language, the efficiency of the query engine, and the overall flexibility and extensibility of the framework. The user interface is implemented using Java and is available on the WWW. SIGMOD Conference Database Performance in the Real World - TPC-D and SAP R/3 (Experience Paper). Jochen Doppelhammer,Thomas Höppler,Alfons Kemper,Donald Kossmann 1997 Database Performance in the Real World - TPC-D and SAP R/3 (Experience Paper). SIGMOD Conference STRUDEL: A Web-site Management System. Mary F. Fernández,Daniela Florescu,Jaewoo Kang,Alon Y. Levy,Dan Suciu 1997 STRUDEL: A Web-site Management System. SIGMOD Conference Picture Programming Project. Nita Goyal,Charles Hoch,Ravi Krishnamurthy,Brian Meckler,Michael Suckow,Moshé M. Zloof 1997 Picture Programming Project. SIGMOD Conference STARTS: Stanford Proposal for Internet Meta-Searching (Experience Paper). Luis Gravano,Kevin Chen-Chuan Chang,Hector Garcia-Molina,Andreas Paepcke 1997 STARTS: Stanford Proposal for Internet Meta-Searching (Experience Paper). SIGMOD Conference A Toolkit for Negotiation Support Interfaces to Multi-Dimensional Data. Michael Gebhardt,Matthias Jarke,Stephan Jacobs 1997 CoDecide is an experimental user interface toolkit that offers an extension to spreadsheet concepts specifically geared towards support for cooperative analysis of the kinds of multi-dimensional data encountered in data warehousing. It is distinguished from previous proposals by direct support for drill-down/roll-up analysis without redesign of an interface; more importantly, CoDecide can link multiple views on a data cube for synchronous or asynchronoous cooperation by multiple analysts, through a conceptual model visualizing the problem dimensions on so-called tapes. Tapes generalize the ideas of ranging and pivoting in current data warehouses for the multi-perspective and multi-user case. CoDecide allows the rapid composition of multi-matrix interfaces and their linkage to underlying data sources. A LAN version of CoDecide has been used in a number of design decision support applications. A WWW version representing externally materialized views on databases is currently under development. SIGMOD Conference A Framework for Implementing Hypothetical Queries. Timothy Griffin,Richard Hull 1997 Previous approaches to supporting hypothetical queries have been “eager”: some representation of the hypothetical state (or the corresponding delta) is materialized, and query evaluation is filtered through that representation. This paper develops a framework for evaluating hypothetical queries using a “lazy” approach, or using a hybrid of eager and lazy approaches. We focus on queries having the form “Q when {{U}}” where Q is a relational algebra query and U is an update expression. The value assigned to this query in state DB is the value that Q would return in the state resulting from executing U on DB. Nesting of the keyword when is permitted, and U may involve a sequence of several atomic updates. We present an equational theory for queries involving when that can be used as a basis for optimization. This theory is very different from traditional rules for the relational algebra, because the semantics of when is unlike the semantics of the algebra operators. Our theory is based on the observation that hypothetical states can be represented as substitutions, similar to those arising in functional and logic programming. Furthermore, hypothetical queries of the form Q when {{U}} can be thought of as representing the suspended application of a substitution. Using the equational theory we develop an approach to optimizing the evaluation of hypothetical queries that uses deltas in the sense of Heraclitus, and permits a range of evaluation strategies from lazy to eager. SIGMOD Conference Infomaster: An Information Integration System. Michael R. Genesereth,Arthur M. Keller,Oliver M. Duschka 1997 "Infomaster is an information integration system that provides integrated access to multiple distributed heterogeneous information sources on the Internet, thus giving the illusion of a centralized, homogeneous information system. We say that Infomaster creates a virtual data warehouse. The core of Infomaster is a facilitator that dynamically determines an efficient way to answer the user's query using as few sources as necessary and harmonizes the heterogeneities among these sources. Infomaster handles both structural and content translation to resolve differences between multiple data sources and the multiple applications for the collected data. Infomaster connects to a variety of databases using wrappers, such as for Z39.50, SQL databases through ODBC, EDI transactions, and other World Wide Web (WWW) sources. There are several WWW user interfaces to Infomaster, including forms based and textual. Infomaster also includes a programmatic interface and it can download results in structured form onto a client computer. Infomaster has been in production use for integrating rental housing advertisements from several newspapers (since fall 1995), and for meeting room scheduling (since winter 1996). Infomaster is also being used to integrate heterogeneous electronic product catalogs." SIGMOD Conference Secure Transaction Processing in Firm Real-Time Database Systems. Binto George,Jayant R. Haritsa 1997 Many real-time database applications arise in safety-critical installations and military systems where enforcing security is crucial to the success of the enterprise. A secure real-time database system has to simultaneously satisfy who requirements guarantee data security and minimize the number of missed transaction deadlines. We investigate here the performance implications, in terms of missed deadlines, of guaranteeing security in a real-time database system. In particular, we focus on the concurrency control aspects of this issue. Our main contributions are the following: First, we identify which among the previously proposed real-time concurrency control protocols are capable of providing protection against both direct and indirect (covert channels) means of unauthorized access to data. Second, using a detailed simulation model of a firm-deadline real-time database system, we profile the real-time performance of a representative set of these secure concurrency control protocols. Our experiments show that a prioritized optimistic concurrency control protocol. OPT-WAIT, provides the best overall performance. Third, we propose and evaluate a novel dual approach to secure transaction concurrency control that allows the real-time database system to simultaneously use different concurrency control mechanisms for guaranteeing security and for improving real-time performance. By appropriately choosing these different mechanisms, we have been able to design hybrid concurrency control algorithms that provide even better performance than OPT-WAIT. SIGMOD Conference Languages for Multi-database Interoperability. Frédéric Gingras,Laks V. S. Lakshmanan,Iyer N. Subramanian,Despina Papoulis,Nematollaah Shiri 1997 Languages for Multi-database Interoperability. SIGMOD Conference Revisiting Commit Processing in Distributed Database Systems. Ramesh Gupta,Jayant R. Haritsa,Krithi Ramamritham 1997 "A significant body of literature is available on distributed transaction commit protocols. Surprisingly, however, the relative merits of these protocols have not been studied with respect to their quantitative impact on transaction processing performance. In this paper, using a detailed simulation model of a distributed database system, we profile the transaction throughput performance of a representative set of commit protocols. A new commit protocol, OPT, that allows transactions to “optimistically” borrow uncommitted data in a controlled manner is also proposed and evaluated. The new protocol is easy to implement and incorporate in current systems, and can coexist with most other optimizations proposed earlier. For example, OPT can be combined with current industry standard protocols such as Presumed Commit and Presumed Abort. The experimental results show that distributed commit processing can have considerably more influence than distributed data processing on the throughput performance and that the choice of commit protocol clearly affects the magnitude of this influence. Among the protocols evaluated, the new optimistic commit protocol provides the best transaction throughput performance for a variety of workloads and system configurations. In fact, OPT's peak throughput is often close to the upper bound on achievable performance. Even more interestingly, a three-phase (i.e., non-blocking) version of OPT provides better peak throughput performance than all of the standard two-phase (i.e., blocking protocols evaluated in our study." SIGMOD Conference Template-Based Wrappers in the TSIMMIS System. Joachim Hammer,Hector Garcia-Molina,Svetlozar Nestorov,Ramana Yerneni,Markus M. Breunig,Vasilis Vassalos 1997 In order to access information from a variety of heterogeneous information sources, one has to be able to translate queries and data from one data model into another. This functionality is provided by so-called (source) wrappers [4,8] which convert queries into one or more commands/queries understandable by the underlying source and transform the native results into a format understood by the application. As part of the TSIMMIS project [1, 6] we have developed hard-coded wrappers for a variety of sources (e.g., Sybase DBMS, WWW pages, etc.) including legacy systems (Folio). However, anyone who has built a wrapper before can attest that a lot of effort goes into developing and writing such a wrapper. In situations where it is important or desirable to gain access to new sources quickly, this is a major drawback. Furthermore, we have also observed that only a relatively small part of the code deals with the specific access details of the source. The rest of the code is either common among wrappers or implements query and data transformation that could be expressed in a high level, declarative fashion. Based on these observations, we have developed a wrapper implementation toolkit [7] for quickly building wrappers. The toolkit contains a library for commonly used functions, such as for receiving queries from the application and packaging results. It also contains a facility for translating queries into source-specific commands, and for translating results into a model useful to the application. The philosophy behind our “template-based” translation methodology is as follows. The wrapper implementor specifies a set of templates (rules) written in a high level declarative language that describe the queries accepted by the wrapper as well as the objects that it returns. If an application query matches a template, an implementor-provided action associated with the template is executed to provide the native query for the underlying source1. When the source returns the result of the query, the wrapper transforms the answer which is represented in the data model of the source into a representation that is used by the application. Using this toolkit one can quickly design a simple wrapper with a few templates that cover some of the desired functionality, probably the one that is most urgently needed. However, templates can be added gradually as more functionality is required later on. Another important use of wrappers is in extending the query capabilities of a source. For instance, some sources may not be capable of answering queries that have multiple predicates. In such cases, it is necessary to pose a native query to such a source using only predicates that the source is capable of handling. The rest of the predicates are automatically separated from the user query and form a filter query. When the wrapper receives the results, a post-processing engine applies the filter query. This engine supports a set of built-in predicates based on the comparison operators =,≠,<,>, etc. In addition, the engine supports more complex predicates that can be specified as part of the filter query. The postprocessing engine is common to wrappers of all sources and is part of the wrapper toolkit. Note that because of postprocessing, the wrapper can handle a much larger class of queries than those that exactly match the templates it has been given. Figure 1 shows an overview of the wrapper architecture as it is currently implemented in our TSIMMIS testbed. Shaded components are provided by the toolkit, the white component is source-specific and must be generated by the implementor. The driver component controls the translation process and invokes the following services: the parser which parses the templates, the native schema, as well as the incoming queries into internal data structures, the matcher which matches a query against the set of templates and creates a filter query for postprocessing if necessary, the native component which submits the generated action string to the source, and extracts the data from the native result using the information given in the source schema, and the engine, which transforms and packages the result and applies a postprocessing filter if one has been created by the matcher. We now describe the sequence of events that occur at the wrapper during the translation of a query and its result using an example from our prototype system. The queries are formulated using a rule-based language called MSL that has been developed as a template specification and query language for the TSIMMIS project. Data is represented using our Object Exchange Model (OEM). We will briefly describe MSL and OEM in the next section. Details on MSL can be found in [5], a full introduction to OEM is given in [1]. SIGMOD Conference Scalable Parallel Data Mining for Association Rules. Eui-Hong Han,George Karypis,Vipin Kumar 1997 In this paper, we propose two new parallel formulations of the Apriori algorithm that is used for computing association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel formulations CD and DD. Unlike the CD algorithm, the IDD algorithm partitions the candidate set intelligently among processors to efficiently parallelize the step of building the hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller communication overhead than DD. But IDD suffers from the added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales as well as IDD with respect to increasing candidate set size. SIGMOD Conference GeoMiner: A System Prototype for Spatial Data Mining. Jiawei Han,Krzysztof Koperski,Nebojsa Stefanovic 1997 Spatial data mining is to mine high-level spatial information and knowledge from large spatial databases. A spatial data mining system prototype, GeoMiner, has been designed and developed based on our years of experience in the research and development of relational data mining system, DBMiner, and our research into spatial data mining. The data mining power of GeoMiner includes mining three kinds of rules: characteristic rules, comparison rules, and association rules, in geo-spatial databases, with a planned extension to include mining classification rules and clustering rules. The SAND (Spatial And Nonspatial Data) architecture is applied in the modeling of spatial databases, whereas GeoMiner includes the spatial data cube construction module, spatial on-line analytical processing (OLAP) module, and spatial data mining modules. A spatial data mining language, GMQL (Geo-Mining Query Language), is designed and implemented as an extension to Spatial SQL [3], for spatial data mining. Moreover, an interactive, user-friendly data mining interface is constructed and tools are implemented for visualization of discovered spatial knowledge. SIGMOD Conference Online Aggregation. Joseph M. Hellerstein,Peter J. Haas,Helen J. Wang 1997 Aggregation in traditional database systems is performed in batch mode: a query is submitted, the system processes a large volume of data over a long period of time, and, eventually, the final answer is returned. This archaic approach is frustrating to users and has been abandoned in most other areas of computing. In this paper we propose a new online aggregation interface that permits users to both observe the progress of their aggregation queries and control execution on the fly. After outlining usability and performance requirements for a system supporting online aggregation, we present a suite of techniques that extend a database system to meet these requirements. These include methods for returning the output in random order, for providing control over the relative rate at which different aggregates are computed, and for computing running confidence intervals. Finally, we report on an initial implementation of online aggregation in POSTGRES. SIGMOD Conference Range Queries in OLAP Data Cubes. Ching-Tien Ho,Rakesh Agrawal,Nimrod Megiddo,Ramakrishnan Srikant 1997 A range query applies an aggregation operation over all selected cells of an OLAP data cube where the selection is specified by providing ranges of values for numeric dimensions. We present fast algorithms for range queries for two types of aggregation operations: SUM and MAX. These two operations cover techniques required for most popular aggregation operations, such as those supported by SQL. For range-sum queries, the essential idea is to precompute some auxiliary information (prefix sums) that is used to answer ad hoc queries at run-time. By maintaining auxiliary information which is of the same size as the data cube, all range queries for a given cube can be answered in constant time, irrespective of the size of the sub-cube circumscribed by a query. Alternatively, one can keep auxiliary information which is 1/bd of the size of the d-dimensional data cube. Response to a range query may now require access to some cells of the data cube in addition to the access to the auxiliary information, but the overall time complexity is typically reduced significantly. We also discuss how the precomputed information is incrementally updated by batching updates to the data cube. Finally, we present algorithms for choosing the subset of the data cube dimensions for which the auxiliary information is computed and the blocking factor to use for each such subset. Our approach to answering range-max queries is based on precomputed max over balanced hierarchical tree structures. We use a branch-and-bound-like procedure to speed up the finding of max in a region. We also show that with a branch-and-bound procedure, the average-case complexity is much smaller than the worst-case complexity. SIGMOD Conference ZOO: A Desktop Emperiment Management Environment. Yannis E. Ioannidis,Miron Livny,Anastassia Ailamaki,Anand Narayanan,Andrew Therber 1997 ZOO: A Desktop Emperiment Management Environment. SIGMOD Conference A Unified Framework for Enforcing Multiple Access Control Policies. Sushil Jajodia,Pierangela Samarati,V. S. Subrahmanian,Elisa Bertino 1997 Although several access control policies can be devised for controlling access to information, all existing authorization models, and the corresponding enforcement mechanisms, are based on a specific policy (usually the closed policy). As a consequence, although different policy choices are possible in theory, in practice only a specific policy can be actually applied within a given system. However, protection requirements within a system can vary dramatically, and no single policy may simultaneously satisfy them all. In this paper we present a flexible authorization manager (FAM) that can enforce multiple access control policies within a single, unified system. FAM is based on a language through which users can specify authorizations and access control policies to be applied in controlling execution of specific actions on given objects. We formally define the language and properties required to hold on the security specifications and prove that this language can express all security specifications. Furthermore, we show that all programs expressed in this language (called FAM/CAM-programs) are also guaranteed to be consistent (i.e., no conflicting access decisions occur) and CAM-programs are complete (i.e., every access is either authorized or denied). We then illustrate how several well-known protection policies proposed in the literature can be expressed in the FAM/CAM language and how users can customize the access control by specifying their own policies. The result is an access control mechanism which is flexible, since different access control policies can all coexist in the same data system, and extensible, since it can be augmented with any new policy a specific application or user may require. SIGMOD Conference The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. "Norio Katayama,Shin'ichi Satoh" 1997 Recently, similarity queries on feature vectors have been widely used to perform content-based retrieval of images. To apply this technique to large databases, it is required to develop multidimensional index structures supporting nearest neighbor queries efficiently. The SS-tree had been proposed for this purpose and is known to outperform other index structures such as the R*-tree and the K-D-B-tree. One of its most important features is that it employs bounding spheres rather than bounding rectangles for the shape of regions. However, we demonstrate in this paper that bounding spheres occupy much larger volume than bounding rectangles with high-dimensional data and that this reduces search efficiency. To overcome this drawback, we propose a new index structure called the SR-tree (Sphere/Rectangle-tree) which integrates bounding spheres and bounding rectangles. A region of the SR-tree is specified by the intersection of a bounding sphere and a bounding rectangle. Incorporating bounding rectangles permits neighborhoods to be partitioned into smaller regions than the SS-tree and improves the disjointness among regions. This enhances the performance on nearest neighbor queries especially for high-dimensional and non-uniform data which can be practical in actual image/video similarity indexing. We include the performance test results the verify this advantage of the SR-tree and show that the SR-tree outperforms both the SS-tree and the R*-tree. SIGMOD Conference Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. Flip Korn,H. V. Jagadish,Christos Faloutsos 1997 Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access. In this paper we consider a very large dataset comprising multiple distinct time sequences. Each point in the sequence is a numerical value. We show how to compress such a dataset into a format that supports ad hoc querying, provided that a small error can be tolerated when the data is uncompressed. Experiments on large, real world datasets (AT&T customer calling patterns) show that the proposed method achieves an average of less than 5% error in any data value after compressing to a mere 2.5% of the original space (i.e., a 40:1 compression ratio), with these numbers not very sensitive to dataset size. Experiments on aggregate queries achieved a 0.5% reconstruction error with a space requirement under 2%. SIGMOD Conference Concurrency and Recovery in Generalized Search Trees. Marcel Kornacker,C. Mohan,Joseph M. Hellerstein 1997 This paper presents general algorithms for concurrency control in tree-based access methods as well as a recovery protocol and a mechanism for ensuring repeatable read. The algorithms are developed in the context of the Generalized Search Tree (GiST) data structure, an index structure supporting an extensible set of queries and data types. Although developed in a GiST context, the algorithms are generally applicable to many tree-based access methods. The concurrency control protocol is based on an extension of the link technique originally developed for B-trees, and completely avoids holding node locks during I/Os. Repeatable read isolation is achieved with a novel combination of predicate locks and two-phase locking of data records. To our knowledge, this is the first time that isolation issues have been addressed outside the context of B-trees. A discussion of the fundamental structural differences between B-trees and more general tree structures like GiSTs explains why the algorithms developed here deviate from their B-tree counterparts. An implementation of GiSTs emulating B-trees in DB2/Common Server is underway. SIGMOD Conference Size Separation Spatial Join. Nick Koudas,Kenneth C. Sevcik 1997 We introduce a new algorithm to compute the spatial join of two or more spatial data sets, when indexes are not available on them. Size Separation Spatial Join (S3J) imposes a hierarchical decomposition of the data space and, in contrast with previous approaches, requires no replication of entities from the input data sets. Thus its execution time depends only on the sizes of the joined data sets. We describe S3J and present an analytical evaluation of its I/O and processor requirements comparing them with those of previously proposed algorithms for the same problem. We show that S3J has relatively simple cost estimation formulas that can be exploited by a query optimizer. S3J can be efficiently implemented using software already present in many relational systems. In addition, we introduce Dynamic Spatial Bitmaps (DSB), a new technique that enables S3J to dynamically or statically exploit bitmap query processing techniques. Finally, we present experimental results for a prototype implementation of S3J involving real and synthetic data sets for a variety of data distributions. Our experimental results are consistent with our analytical observations and demonstrate the performance benefits of S3J over alternative approaches that have been proposed recently. SIGMOD Conference Databases on the Web: Technologies for Federation Architectures and Case Studies (Tutorial). Ralf Kramer 1997 Databases on the Web: Technologies for Federation Architectures and Case Studies (Tutorial). SIGMOD Conference The WHIPS Prototype for Data Warehouse Creation and Maintenance. Wilburt Labio,Yue Zhuge,Janet L. Wiener,Himanshu Gupta,Hector Garcia-Molina,Jennifer Widom 1997 A data warehouse is a repository of integrated information from distributed, autonomous, and possibly heterogeneous, sources. In effect, the warehouse stores one or more materialized views of the source data. The data is then readily available to user applications for querying and analysis. Figure 1 shows the basic architecture of a warehouse: data is collected from each source, integrated with data from other sources, and stored at the warehouse. Users then access the data directly from the warehouse. As suggested by Figure 1, there are two major components in a warehouse system: the integration component, responsible for collecting and maintaining the materialized views, and the query and analysis component, responsible for fulfilling the information needs of specific end users. Note that the two components are not independent. For example, which views the integration component materializes depends on the expected needs of end users. Most current commercial warehousing systems (e.g., Redbrick, Sybase, Arbor) focus on the query and analysis component, providing specialized index structures at the warehouse and extensive querying facilities for the end user. In the WHIPS (WareHousing Information Project at Stanford) project, on the other hand, we focus on the integration component. In particular, we have developed an architecture and implemented a prototype for identifying data changes at heterogeneous sources, transforming them and summarizing them in accordance to warehouse specifications, and incrementally integrating them into the warehouse. We propose to demonstrate our prototype at SIGMOD, illustrating the main features of our architecture. Our architecture is modular and we designed it specifically to fulfill several important and interrelated goals: data sources and warehouse views can be added and removed dynamically; it is scalable by adding more internal modules; changes at the sources are detected automatically; the warehouse may be updated continuously as the sources change, without requiring “down time;” and the warehouse is always kept consistent with the source data by the integration algorithms. More details on these goals and how we achieve them are provided in [WGL+96]. SIGMOD Conference DEVise: Integrated Querying and Visualization of Large Datasets. Miron Livny,Raghu Ramakrishnan,Kevin S. Beyer,Guangshun Chen,Donko Donjerkovic,Shilpa Lawande,Jussi Myllymaki,R. Kent Wenger 1997 DEVise: Integrated Querying and Visualization of Large Datasets. SIGMOD Conference SEMCOG: An Object-based Image Retrieval System and Its Visual Query Interface. Wen-Syan Li,K. Selçuk Candan,Kyoji Hirata,Yoshinori Hara 1997 SEMCOG: An Object-based Image Retrieval System and Its Visual Query Interface. SIGMOD Conference DEVise: Integrated Querying and Visual Exploration of Large Datasets (Demo Abstract). Miron Livny,Raghu Ramakrishnan,Kevin S. Beyer,Guangshun Chen,Donko Donjerkovic,Shilpa Lawande,Jussi Myllymaki,R. Kent Wenger 1997 DEVise: Integrated Querying and Visual Exploration of Large Datasets (Demo Abstract). SIGMOD Conference Partitioned Garbage Collection of Large Object Store. Umesh Maheshwari,Barbara Liskov 1997 Partitioned Garbage Collection of Large Object Store. SIGMOD Conference Eliminating Costly Redundant Computations from SQL Trigger Executions. François Llirbat,Françoise Fabret,Eric Simon 1997 "Active database systems are now in widespread use. The use of triggers in these systems, however, is difficult because of the complex interaction between triggers, transactions, and application programs. Repeated calculations of rules may incur costly redundant computations in rule conditions and actions. In this paper, we focus on active relational database systems supporting SQL triggers. In this context, we provide a powerful and complete solution to eliminate redundant computations of SQL triggers when they are costly. We define a model to describe programs, rules and their interactions. We provide algorithms to extract invariant subqueries from trigger's condition and action. We define heuristics to memorize the most “profitable” invariants. Finally, we develop a rewriting technique that enables to generate and execute the optimized code of SQL triggers." SIGMOD Conference Temporal Aggregation in Active Database Rules. Iakovos Motakis,Carlo Zaniolo 1997 An important feature of many advanced active database prototypes is support for rules triggered by complex patterns of events. Their composite event languages provide powerful primitives for event-based temporal reasoning. In fact, with one important exception, their expressive power matches and surpasses that of sophisticated languages offered by Time Series Management Systems (TSMS), which have been extensively used for temporal data analysis and knowledge discovery. This exception pertains to temporal aggregation, for which, current active database systems offer only minimal support, if any. In this paper, we introduce the language TREPL, which addresses this problem. The TREPL prototype, under development at UCLA, offers primitives for temporal aggregation that exceed the capabilities of state-of-the-art composite event languages, and are comparable to those of TSMS languages. TREPL also demonstrates a rigorous and general approach to the definition of composite event language semantics. The meaning of a TREPL rule is formally defined by mapping it into a set of Datalog1S rules, whose logic-based semantics characterizes the behavior of the original rule. This approach handles naturally temporal aggregates, including user-defined ones, and is also applicable to other composite event languages, such as ODE, Snoop and SAMOS. SIGMOD Conference Maintenance of Data Cubes and Summary Tables in a Warehouse. Inderpal Singh Mumick,Dallan Quass,Barinderpal Singh Mumick 1997 Data warehouses contain large amounts of information, often collected from a variety of independent sources. Decision-support functions in a warehouse, such as on-line analytical processing (OLAP), involve hundreds of complex aggregate queries over large volumes of data. It is not feasible to compute these queries by scanning the data sets each time. Warehouse applications therefore build a large number of summary tables, or materialized aggregate views, to help them increase the system performance. As changes, most notably new transactional data, are collected at the data sources, all summary tables at the warehouse that depend upon this data need to be updated. Usually, source changes are loaded into the warehouse at regular intervals, usually once a day, in a batch window, and the warehouse is made unavailable for querying while it is updated. Since the number of summary tables that need to be maintained is often large, a critical issue for data warehousing is how to maintain the summary tables efficiently. In this paper we propose a method of maintaining aggregate views (the summary-delta table method), and use it to solve two problems in maintaining summary tables in a warehouse: (1) how to efficiently maintain a summary table while minimizing the batch window needed for maintenance, and (2) how to maintain a large set of summary tables defined over the same base tables. While several papers have addressed the issues relating to choosing and materializing a set of summary tables, this is the first paper to address maintaining summary tables efficiently. SIGMOD Conference Improved Query Performance with Variant Indexes. "Patrick E. O'Neil,Dallan Quass" 1997 The read-mostly environment of data warehousing makes it possible to use more complex indexes to speed up queries than in situations where concurrent updates are present. The current paper presents a short review of current indexing technology, including row-set representation by Bitmaps, and then introduces two approaches we call Bit-Sliced indexing and Projection indexing. A Projection index materializes all values of a column in RID order, and a Bit-Sliced index essentially takes an orthogonal bit-by-bit view of the same data. While some of these concepts started with the MODEL 204 product, and both Bit-Sliced and Projection indexing are now fully realized in Sybase IQ, this is the first rigorous examination of such indexing capabilities in the literature. We compare algorithms that become feasible with these variant index types against algorithms using more conventional indexes. The analysis demonstrates important performance advantages for variant indexes in some types of SQL aggregation, predicate evaluation, and grouping. The paper concludes by introducing a new method whereby multi-dimensional group-by queries, reminiscent of OLAP/Datacube queries but with more flexibility, can be very efficiently performed. SIGMOD Conference On-Line Warehouse View Maintenance. Dallan Quass,Jennifer Widom 1997 Data warehouses store materialized views over base data from external sources. Clients typically perform complex read-only queries on the views. The views are refreshed periodically by maintenance transactions, which propagate large batch updates from the base tables. In current warehousing systems, maintenance transactions usually are isolated from client read activity, limiting availability and/or size of the warehouse. We describe an algorithm called 2VNL that allows warehouse maintenance transactions to run concurrently with readers. By logically maintaining two versions of the database, no locking is required and serializability is guaranteed. We present our algorithm, explain its relationship to other multi-version concurrency control algorithms, and describe how it can be implemented on top of a conventional relational DBMS using a query rewrite approach. SIGMOD Conference Similarity-Based Queries for Time Series Data. Davood Rafiei,Alberto O. Mendelzon 1997 We study a set of linear transformations on the Fourier series representation of a sequence that can be used as the basis for similarity queries on time-series data. We show that our set of transformations is rich enough to formulate operations such as moving average and time warping. We present a query processing algorithm that uses the underlying R-tree index of a multidimensional data set to answer similarity queries efficiently. Our experiments show that the performance of this algorithm is competitive to that of processing ordinary (exact match) queries using the index, and much faster than sequential scanning. We relate our transformations to the general framework for similarity queries of Jagadish et al. SIGMOD Conference Building a Scaleable Geo-Spatial DBMS: Technology, Implementation, and Evaluation. Jignesh M. Patel,Jie-Bing Yu,Navin Kabra,Kristin Tufte,Biswadeep Nag,Josef Burger,Nancy E. Hall,Karthikeyan Ramasamy,Roger Lueder,Curt J. Ellmann,Jim Kupsch,Shelly Guo,David J. DeWitt,Jeffrey F. Naughton 1997 Building a Scaleable Geo-Spatial DBMS: Technology, Implementation, and Evaluation. SIGMOD Conference Cubetree: Organization of and Bulk Updates on the Data Cube. Nick Roussopoulos,Yannis Kotidis,Mema Roussopoulos 1997 Cubetree: Organization of and Bulk Updates on the Data Cube. SIGMOD Conference PREDATOR: An OR-DBMS with Enhanced Data Types. Praveen Seshadri,Mark Paskin 1997 PREDATOR: An OR-DBMS with Enhanced Data Types. SIGMOD Conference Lessons from Wall Street: Case Studies in Configuration, Tuning, and Distribution (Tutorial). Dennis Shasha 1997 Lessons from Wall Street: Case Studies in Configuration, Tuning, and Distribution (Tutorial). SIGMOD Conference Wave-Indices: Indexing Evolving Databases. Narayanan Shivakumar,Hector Garcia-Molina 1997 In many applications, new data is being generated every day. Often an index of the data of a past window of days is required to answer queries efficiently. For example, in a warehouse one may need an index on the sales records of the last week for efficient data mining, or in a Web service one may provide an index of Netnews articles of the past month. In this paper, we propose a variety of wave indices where the data of a new day can be efficiently added, and old data can be quickly expired, to maintain the required window. We compare these schemes based on several system performance measures, such as storage, query response time, and maintenance work, as well as on their simplicity and ease of coding. SIGMOD Conference The Distributed Information Search Component (Disco) and the World Wide Web. Anthony Tomasic,Rémy Amouroux,Philippe Bonnet,Olga Kapitskaia,Hubert Naacke,Louiqa Raschid 1997 The Distributed Information Search COmponent (DISCO) is a prototype heterogeneous distributed database that accesses underlying data sources. The DISCO prototype currently focuses on three central research problems in the context of these systems. First, since the capabilities of each data source is different, transforming queries into subqueries on data source is difficult. We call this problem the weak data source problem. Second, since each data source performs operations in a generally unique way, the cost for performing an operation may vary radically from one wrapper to another. We call this problem the radical cost problem. Finally, existing systems behave rudely when attempting to access an unavailable data source. We call this problem the ungraceful failure problem. DISCO copes with these problems. For the weak data source problem, the database implementor defines precisely the capabilities of each data source. For the radical cost problem, the database implementor (optionally) defines cost information for some of the operations of a data source. The mediator uses this cost information to improve its cost model. To deal with ungraceful failures, queries return partial answers. A partial answer contains the part of the final answer to the query that was produced by the available data sources. The current working prototype of DISCO contains implementations of these solutions and operations over a collection of wrappers that access information both in files and on the World Wide Web. SIGMOD Conference Database Buffer Size Investigation for OLTP Workloads (Experience Paper). Thin-Fong Tsuei,Allan Packer,Keng-Tai Ko 1997 Database Buffer Size Investigation for OLTP Workloads (Experience Paper). SIGMOD Conference Structural Matching and Discovery in Document Databases. Jason Tsong-Li Wang,Dennis Shasha,George Jyh-Shian Chang,Liam Relihan,Kaizhong Zhang,Girish Patel 1997 Structural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be interested in knowing changes in an HTML document [2, 5, 10]. Such changes can be detected by comparing the old and new version of the document (referred to as structural matching of documents). As another example, in hypertext authoring, a user may wish to find the common portions in the history list of a document or in a database of documents (referred to as structural discovery of documents). In SIGMOD 95 demo sessions, we exhibited a software package, called TreeDiff [13], for comparing two latex documents and showing their differences. Given two documents, the tool represents the documents as ordered labeled trees and finds an optimal sequence of edit operations to transform one document (tree) to the other. An edit operation could be an insert, delete, or change of a node in the trees. The tool is so named because documents are represented and compared using approximate tree matching techniques [9, 12, 14]. SIGMOD Conference The MENTOR Workbench for Enterprise-wide Workflow Management. Dirk Wodtke,Jeanine Weißenfels,Gerhard Weikum,Angelika Kotz Dittrich,Peter Muth 1997 MENTOR (“Middleware for Enterprise-Wide Workflow Management”) is a joint project of the University of the Saarland, the Union Bank of Switzerland, and ETH Zurich [1, 2, 3]. The focus of the project is on enterprise-wide workflow management. Workflows in this category may span multiple organizational units each unit having its own workflow server, involve a variety of heterogeneous information systems, and require many thousands of clients to interact with the workflow management system (WFMS). The project aims to develop a scalable and highly available environment for the execution and monitoring of workflows, seamlessly integrated with a specification and verification environment. For the specification of workflows, MENTOR utilizes the formalism of state and activity charts. The mathematical rigor of the specification method establishes a basis for both correctness reasoning and for partitioning of a large workflow into a number of subworkflows according to the organizational responsibilities of the enterprise. For the distributed execution of the partitioned workflow specification, MENTOR relies mostly on standard middleware components and adds own components only where the standard components fall short of functionality or scalability. In particular, the run-time environment is based on a TP monitor and a CORBA implementation. SIGMOD Conference Association Rules over Interval Data. Renée J. Miller,Yuping Yang 1997 We consider the problem of mining association rules over interval data (that is, ordered data for which the separation between data points has meaning). We show that the measures of what rules are most important (also called rule interest) that are used for mining nominal and ordinal data do not capture the semantics of interval data. In the presence of interval data, support and confidence are no longer intuitive measures of the interest of a rule. We propose a new definition of interest for association rules that takes into account the semantics of interval data. We developed an algorithm for mining association rules under the new definition and overview our experience using the algorithm on large real-life datasets. SIGMOD Conference Highly Concurrent Cache Consistency for Indices in Client-Server Database Systems. Markos Zaharioudakis,Michael J. Carey 1997 In this paper, we present four approaches to providing highly concurrent B+-tree indices in the context of a data-shipping, client-server OODBMS architecture. The first performs all index operations at the server, while the other approaches support varying degrees of client caching and usage of index pages. We have implemented the four approaches, as well as the 2PL approach, in the context of the SHORE OODB system at Wisconsin, and we present experimental results from a performance study based on running SHORE on an IBM SP2 multicomputer. Our results emphasize the need for non-2PL approaches and demonstrate the tradeoffs between 2PL, no-caching, and the three caching alternatives. SIGMOD Conference An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. Yihong Zhao,Prasad Deshpande,Jeffrey F. Naughton 1997 Computing multiple related group-bys and aggregates is one of the core operations of On-Line Analytical Processing (OLAP) applications. Recently, Gray et al. [GBLP95] proposed the “Cube” operator, which computes group-by aggregations over all possible subsets of the specified dimensions. The rapid acceptance of the importance of this operator has led to a variant of the Cube being proposed for the SQL standard. Several efficient algorithms for Relational OLAP (ROLAP) have been developed to compute the Cube. However, to our knowledge there is nothing in the literature on how to compute the Cube for Multidimensional OLAP (MOLAP) systems, which store their data in sparse arrays rather than in tables. In this paper, we present a MOLAP algorithm to compute the Cube, and compare it to a leading ROLAP algorithm. The comparison between the two is interesting, since although they are computing the same function, one is value-based (the ROLAP algorithm) whereas the other is position-based (the MOLAP algorithm). Our tests show that, given appropriate compression techniques, the MOLAP algorithm is significantly faster than the ROLAP algorithm. In fact, the difference is so pronounced that this MOLAP algorithm may be useful for ROLAP systems as well as MOLAP systems, since in many cases, instead of cubing a table directly, it is faster to first convert the table to an array, cube the array, then convert the result back to a table. VLDB Distributed Processing over Stand-alone Systems and Applications. Gustavo Alonso,Claus Hagen,Hans-Jörg Schek,Markus Tresch 1997 Distributed Processing over Stand-alone Systems and Applications. VLDB A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data. Khaled Alsabti,Sanjay Ranka,Vineet Singh 1997 A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data. VLDB Garbage Collection in Object Oriented Databases Using Transactional Cyclic Reference Counting. Srinivas Ashwin,Prasan Roy,S. Seshadri,Abraham Silberschatz,S. Sudarshan 1997 Garbage Collection in Object Oriented Databases Using Transactional Cyclic Reference Counting. VLDB To Weave the Web. Paolo Atzeni,Giansalvatore Mecca,Paolo Merialdo 1997 To Weave the Web. VLDB Materialized Views Selection in a Multidimensional Database. Elena Baralis,Stefano Paraboschi,Ernest Teniente 1997 Materialized Views Selection in a Multidimensional Database. VLDB The Microsoft Repository. Philip A. Bernstein,Brian Harry,Paul Sanders,David Shutt,Jason Zander 1997 The Microsoft Repository. VLDB Geo/Environmental and Medical Data Management in the RasDaMan System. Peter Baumann,Paula Furtado,Roland Ritsch,Norbert Widmann 1997 Geo/Environmental and Medical Data Management in the RasDaMan System. VLDB A Generic Approach to Bulk Loading Multidimensional Index Structures. Jochen Van den Bercken,Bernhard Seeger,Peter Widmayer 1997 A Generic Approach to Bulk Loading Multidimensional Index Structures. VLDB Logical and Physical Versioning in Main Memory Databases. Rajeev Rastogi,S. Seshadri,Philip Bohannon,Dennis W. Leinbaugh,Abraham Silberschatz,S. Sudarshan 1997 Logical and Physical Versioning in Main Memory Databases. VLDB Integrating Reliable Memory in Databases. Wee Teck Ng,Peter M. Chen 1997 Recent results in the Rio project at the University of Michigan show that it is possible to create an area of main memory that is as safe as disk from operating system crashes. This paper explores how to integrate the reliable memory provided by the Rio file cache into a database system. Prior studies have analyzed the performance benefits of reliable memory; we focus instead on how different designs affect reliability. We propose three designs for integrating reliable memory into databases: non-persistent database buffer cache, persistent database buffer cache, and persistent database buffer cache with protection. Non-persistent buffer caches use an I/O interface to reliable memory and require the fewest modifications to existing databases. However, they waste memory capacity and bandwidth due to double buffering. Persistent buffer caches use a memory interface to reliable memory by mapping it into the database address space. This places reliable memory under complete database control and eliminates double buffering, but it may expose the buffer cache to database errors. Our third design reduces this exposure by write protecting the buffer pages. Extensive fault tests show that mapping reliable memory into the database address space does not significantly hurt reliability. This is because wild stores rarely touch dirty, committed pages written by previous transactions. As a result, we believe that databases should use a memory interface to reliable memory. VLDB The Oracle Universal Server Buffer. William Bridge,Ashok Joshi,M. Keihl,Tirthankar Lahiri,Juan Loaiza,N. MacNaughton 1997 The Oracle Universal Server Buffer. VLDB Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. Soumen Chakrabarti,Byron Dom,Rakesh Agrawal,Prabhakar Raghavan 1997 Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. VLDB Effective Memory Use in a Media Server. Edward Y. Chang,Hector Garcia-Molina 1997 Effective Memory Use in a Media Server. VLDB Groupwise Processing of Relational Queries. Damianos Chatziantoniou,Kenneth A. Ross 1997 Groupwise Processing of Relational Queries. VLDB Principles of Optimally Placing Data in Tertiary Storage Libraries. Stavros Christodoulakis,Peter Triantafillou,Fenia Zioga 1997 Principles of Optimally Placing Data in Tertiary Storage Libraries. VLDB An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. Surajit Chaudhuri,Vivek R. Narasayya 1997 An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. VLDB M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. Paolo Ciaccia,Marco Patella,Pavel Zezula 1997 M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB Optimizing Queries with Universal Quantification in Object-Oriented and Object-Relational Databases. Jens Claußen,Alfons Kemper,Guido Moerkotte,Klaus Peithner 1997 Optimizing Queries with Universal Quantification in Object-Oriented and Object-Relational Databases. VLDB Towards an ODMG-Compliant Visual Object Query Language. Manoj Chavda,Peter T. Wood 1997 Towards an ODMG-Compliant Visual Object Query Language. VLDB Integrating SQL Databases with Content-Specific Search Engines. Stefan Deßloch,Nelson Mendonça Mattos 1997 Integrating SQL Databases with Content-Specific Search Engines. VLDB Finding Data in the Neighborhood. André Eickler,Alfons Kemper,Donald Kossmann 1997 Finding Data in the Neighborhood. VLDB Recovering Information from Summary Data. Christos Faloutsos,H. V. Jagadish,Nikolaos Sidiropoulos 1997 Recovering Information from Summary Data. VLDB Resource Scheduling in Enhanced Pay-Per-View Continuous Media Databases. Minos N. Garofalakis,Banu Özden,Abraham Silberschatz 1997 Resource Scheduling in Enhanced Pay-Per-View Continuous Media Databases. VLDB Using Probabilistic Information in Data Integration. Daniela Florescu,Daphne Koller,Alon Y. Levy 1997 Using Probabilistic Information in Data Integration. VLDB Fast Incremental Maintenance of Approximate Histograms. Phillip B. Gibbons,Yossi Matias,Viswanath Poosala 1997 Many commercial database systems maintain histograms to summarize the contents of large relations and permit efficient estimation of query result sizes for use in query optimizers. Delaying the propagation of database updates to the histogram often introduces errors into the estimation. This article presents new sampling-based approaches for incremental maintenance of approximate histograms. By scheduling updates to the histogram based on the updates to the database, our techniques are the first to maintain histograms effectively up to date at all times and avoid computing overheads when unnecessary. Our techniques provide highly accurate approximate histograms belonging to the equidepth and Compressed classes. Experimental results show that our new approaches provide orders of magnitude more accurate estimation than previous approaches.An important aspect employed by these new approaches is a backing sample, an up-to-date random sample of the tuples currently in a relation. We provide efficient solutions for maintaining a uniformly random sample of a relation in the presence of updates to the relation. The backing sample techniques can be used for any other application that relies on random samples of data. VLDB Data Manager for Evolvable Real-time Command and Control Systems. Eric Hughes,Roman Ginis,Bhavani M. Thuraisingham,Peter C. Krupp,John A. Maurer 1997 Data Manager for Evolvable Real-time Command and Control Systems. VLDB DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. Roy Goldman,Jennifer Widom 1997 DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources. Minos N. Garofalakis,Yannis E. Ioannidis 1997 Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources. VLDB Merging Ranks from Heterogeneous Internet Sources. Luis Gravano,Hector Garcia-Molina 1997 Merging Ranks from Heterogeneous Internet Sources. VLDB A Foundation for Multi-dimensional Databases. Marc Gyssens,Laks V. S. Lakshmanan 1997 A Foundation for Multi-dimensional Databases. VLDB Optimizing Queries Across Diverse Data Sources. Laura M. Haas,Donald Kossmann,Edward L. Wimmers,Jun Yang 1997 Optimizing Queries Across Diverse Data Sources. VLDB Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates. Sven Helmer,Guido Moerkotte 1997 Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates. VLDB Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations. Yun-Wu Huang,Ning Jing,Elke A. Rundensteiner 1997 Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations. VLDB 1-Safe Algorithms for Symmetric Site Configurations. Rune Humborstad,Maitrayi Sabaratnam,Svein-Olaf Hvasshovd,Øystein Torbjørnsen 1997 1-Safe Algorithms for Symmetric Site Configurations. VLDB Multiple-View Self-Maintenance in Data Warehousing Environments. Nam Huyn 1997 Multiple-View Self-Maintenance in Data Warehousing Environments. VLDB Innovation in Database Management: Computer Science vs. Engineering. Kenneth R. Jacobs 1997 Innovation in Database Management: Computer Science vs. Engineering. VLDB Incremental Organization for Data Recording and Warehousing. H. V. Jagadish,P. P. S. Narayan,S. Seshadri,S. Sudarshan,Rama Kanneganti 1997 Incremental Organization for Data Recording and Warehousing. VLDB Implementing Abstract Objects with Inheritance in Datalog. Hasan M. Jamil 1997 Implementing Abstract Objects with Inheritance in Datalog. VLDB Mining Insurance Data at Swiss Life. Jörg-Uwe Kietz,Ulrich Reimer,Martin Staudt 1997 Mining Insurance Data at Swiss Life. VLDB Vertical Data Migration in Large Near-Line Document Archives Based on Markov-Chain Predictions. Achim Kraiss,Gerhard Weikum 1997 Vertical Data Migration in Large Near-Line Document Archives Based on Markov-Chain Predictions. VLDB Caprera: An Activity Framework for Transaction Processing on Wide-Area Networks. Suresh Kumar,Eng-Kee Kwang,Divyakant Agrawal 1997 Caprera: An Activity Framework for Transaction Processing on Wide-Area Networks. VLDB A Region Splitting Strategy for Physical Database Design of Multidimensional File Organizations. Jong-Hak Lee,Young-Koo Lee,Kyu-Young Whang,Il-Yeol Song 1997 A Region Splitting Strategy for Physical Database Design of Multidimensional File Organizations. VLDB Facilitating Multimedia Database Exploration through Visual Interfaces and Perpetual Query Reformulations. Wen-Syan Li,K. Selçuk Candan,Kyoji Hirata,Yoshinori Hara 1997 Facilitating Multimedia Database Exploration through Visual Interfaces and Perpetual Query Reformulations. VLDB Using Versions in Update Transactions: Application to Integrity Checking. François Llirbat,Eric Simon,Dimitri Tombroff 1997 Using Versions in Update Transactions: Application to Integrity Checking. VLDB The Network as a Global Database: Challenges of Interoperability, Proactivity, Interactiveness, Legacy. Peter C. Lockemann,Ulrike Kölsch,Arne Koschel,Ralf Kramer,Ralf Nikolai,Mechtild Wallrath,Hans-Dirk Walter 1997 The Network as a Global Database: Challenges of Interoperability, Proactivity, Interactiveness, Legacy. VLDB Critical Database Technologies for High Energy Physics. David M. Malon,Edward N. May 1997 Critical Database Technologies for High Energy Physics. VLDB A Language for Manipulating Arrays. Arunprasad P. Marathe,Kenneth Salem 1997 A Language for Manipulating Arrays. VLDB Efficient Construction of Regression Trees with Range and Region Splitting. Yasuhiko Morimoto,Hiromu Ishii,Shinichi Morishita 1997 Efficient Construction of Regression Trees with Range and Region Splitting. VLDB The Complexity of Transformation-Based Join Enumeration. Arjan Pellenkoft,César A. Galindo-Legaria,Martin L. Kersten 1997 The Complexity of Transformation-Based Join Enumeration. VLDB Selectivity Estimation Without the Attribute Value Independence Assumption. Viswanath Poosala,Yannis E. Ioannidis 1997 Selectivity Estimation Without the Attribute Value Independence Assumption. VLDB Fast Computation of Sparse Datacubes. Kenneth A. Ross,Divesh Srivastava 1997 Fast Computation of Sparse Datacubes. VLDB "Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources." Mary Tork Roth,Peter M. Schwarz 1997 "Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources." VLDB Efficient User-Adaptable Similarity Search in Large Multimedia Databases. Thomas Seidl,Hans-Peter Kriegel 1997 Efficient User-Adaptable Similarity Search in Large Multimedia Databases. VLDB Multidimensional Access Methods: Trees Have Grown Everywhere. Timos K. Sellis,Nick Roussopoulos,Christos Faloutsos 1997 Multidimensional Access Methods: Trees Have Grown Everywhere. VLDB The Case for Enhanced Abstract Data Types. Praveen Seshadri,Miron Livny,Raghu Ramakrishnan 1997 Support for complex data in object-relational database systems is based on abstract data types (ADTs). We argue that the current ADT approach inhibits the performance of queries that involve expensive operations on data types. Instead, we propose the Enhanced Abstract Data Type (E-ADT) paradigm, which treats operations on data types as declarative expressions that can be optimized. In this paper, we describe the E-ADT paradigm and PREDATOR, an object-relational database system based on E-ADTs. An E-ADT is an abstract data type enhanced with query optimization. Not only does an E-ADT provide operations (or methods) that can be used in SQL queries, it also supports internal interfaces that can be invoked to optimize these operations. This added functionality is provided without compromising the modularity of data types and the extensibility of the type system. Building such a database system requires fundamental changes in the architecture of the query processing engine; we present the system-level interfaces of PREDATOR that support E-ADTs, and describe the internal design details. Initial performance results from supporting image, time-series, and audio data as E-ADTs demonstrate an order of magnitude in performance improvements over the current ADT approach. Further, we describe how the E-ADT paradigm enables future research that can improve several aspects of object-relational query optimization. Consequently, we make the case that next-generation object-relational database systems should be based on E-ADT technology. VLDB Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications. John C. Shafer,Rakesh Agrawal 1997 Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications. VLDB Concurrent Garbage Collection in O2. Marcin Skubiszewski,Patrick Valduriez 1997 Concurrent Garbage Collection in O2. VLDB Adaptive Data Broadcast in Hybrid Networks. Konstantinos Stathatos,Nick Roussopoulos,John S. Baras 1997 Adaptive Data Broadcast in Hybrid Networks. VLDB Data Warehouse Configuration. Dimitri Theodoratos,Timos K. Sellis 1997 Data Warehouse Configuration. VLDB On-Demand Data Elevation in Hierarchical Multimedia Storage Servers. Peter Triantafillou,Thomas Papadakis 1997 On-Demand Data Elevation in Hierarchical Multimedia Storage Servers. VLDB Describing and Using Query Capabilities of Heterogeneous Sources. Vasilis Vassalos,Yannis Papakonstantinou 1997 Describing and Using Query Capabilities of Heterogeneous Sources. VLDB STING: A Statistical Information Grid Approach to Spatial Data Mining. Wei Wang,Jiong Yang,Richard R. Muntz 1997 STING: A Statistical Information Grid Approach to Spatial Data Mining. VLDB GTE SuperPages: Using IR Techniques for Searching Complex Objects. Steven D. Whitehead,Himanshu Sinha,Michael Murphy 1997 GTE SuperPages: Using IR Techniques for Searching Complex Objects. VLDB Efficient Testing of High Performance Transaction Processing Systems. D. Wildfogel,Ramana Yerneni 1997 Efficient Testing of High Performance Transaction Processing Systems. VLDB Algorithms for Materialized View Design in Data Warehousing Environment. Jian Yang,Kamalakar Karlapalem,Qing Li 1997 Algorithms for Materialized View Design in Data Warehousing Environment. VLDB Dynamic Memory Adjustment for External Mergesort. Weiye Zhang,Per-Åke Larson 1997 Dynamic Memory Adjustment for External Mergesort. SIGMOD Record Integrating Modelling Systems for Environmental Management Information Systems. David J. Abel,Kerry L. Taylor,Dean Kuo 1997 Special purpose modelling packages can become more accessible and more effective for decision support when integrated into a spatial information system. Integration is made difficult by differences in the models due to scope, underlying data models, and command languages. This paper extends a federated information systems design methodology and architecture by identifying parallels of the model integration problem with the database integration problem in federated database design. A schema architecture is proposed together with associated schema translation functions. The role of a problem statement, analogous to a federated database query, is defined. Our design approach is demonstrated in HYDRA, a decision support system for water quality management. SIGMOD Record Semistructured und Structured Data in the Web: Going Back and Forth. Paolo Atzeni,Giansalvatore Mecca,Paolo Merialdo 1997 Semistructured und Structured Data in the Web: Going Back and Forth. SIGMOD Record Intelligent Access to Heterogeneous Information Sources: Report on the 4th Workshop on Knowledge Representation Meets Databases. Franz Baader,Manfred A. Jeusfeld,Werner Nutt 1997 Intelligent Access to Heterogeneous Information Sources: Report on the 4th Workshop on Knowledge Representation Meets Databases. SIGMOD Record Wrapper Generation for Semi-structured Internet Sources. Naveen Ashish,Craig A. Knoblock 1997 With the current explosion of information on the World Wide Web (WWW) a wealth of information on many different subjects has become available on-line. Numerous sources contain information that can be classified as semi-structured. At present, however, the only way to access the information is by browsing individual pages. We cannot query web documents in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured WWW sources by building wrappers around these sources. We present an approach for semi-automatically generating such wrappers. The key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit. SIGMOD Record Quasi-Cubes: Exploiting Approximations in Multidimensional Databases. Daniel Barbará,Mark Sullivan 1997 A data cube is a popular organization for summary data. A cube is simply a multidimensional structure that contains at each point an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. In practical situations, cubes can require a large amount of storage. The typical approach to reducing storage cost is to materialize parts of the cube on demand. Unfortunately, this lazy evaluation can be a time-consuming operation. In this paper, we describe an approximation technique that reduces the storage cost of the cube without incurring the run time cost of lazy evaluation. The idea is to provide an incomplete description of the cube and a method of estimating the missing entries with a certain level of accuracy. The description, of course, should take a fraction of the space of the full cube and the estimation procedure should be faster than computing the data from the underlying relations. Since cubes are used to support data analysis and analysts are rarely interested in the precise values of the aggregates (but rather in trends), providing approximate answers is, in most cases, a satisfactory compromise. Alternatively, the technique can be used to implement a multiresolution system in which a tradeoff is established between the execution time of queries and the errors the user is willing to tolerate. By only going to the disk when it is necessary (to reduce the errors), the query can be executed faster. This idea can be extended to produce a system that incrementally increases the accuracy of the answer while the user is looking at it, supporting on-line aggregation. SIGMOD Record Mediator Languages - a Proposal for a Standard. Peter Buneman,Louiqa Raschid,Jeffrey D. Ullman 1997 Mediator Languages - a Proposal for a Standard. SIGMOD Record An Overview of Data Warehousing and OLAP Technology. Surajit Chaudhuri,Umeshwar Dayal 1997 Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996. SIGMOD Record OGDI: Toward Interoperability among Geospatial Databases. Gilles Clement,Christian Larouche,Denis Gouin,Paul Morin,Henry Kucera 1997 The growth of the geomatics industry is stunted by the difficulty of obtaining and transforming suitable spatial data. This paper describes a remedy: the Open Geospatial Datastore Interface (OGDI), which permits application software to access a variety of spatial data products. The discussion compares the OGDI approach to other standards efforts and describes the characteristics and use of OGDI, which is in the public domain. SIGMOD Record "Research Issues in Federated Database Systems: Report of EFDBS '97 Workshop." Stefan Conrad,Barry Eaglestone,Wilhelm Hasselbring,Mark Roantree,Fèlix Saltor,Martin Schönhoff,Markus Strässler,Mark W. W. Vermeer 1997 "Research Issues in Federated Database Systems: Report of EFDBS '97 Workshop." SIGMOD Record The Database and Information System Research Group at the University of Ulm. Peter Dadam,Wolfgang Klas 1997 "The University of Ulm was founded in 1967 with focus on medicine and natural sciences. In 1989 the University established two new faculties: Engineering Sciences and Computer Science. This enlargement took place within the framework of the so-called Science City Ulm. In a joint effort, the State of Baden-Württemberg, industrial companies, the University, and the City of Ulm successfully established a research and development infrastructure at or nearby the university campus consisting of the university's research labs, university-related research institutes like the Research Institute for Applied Knowledge Processing (FAW), and industrial research and development labs, especially a large research center of Daimler-Benz AG. Today, the Faculty of Computer Science consists of seven divisions (called 'departments'), each of which equipped with two professor positions: Theoretical Computer Science Artificial Intelligence Distributed Systems Databases and Information Systems Software Technology and Compiler Construction Computer Structures Neural Information Processing. The Dept. of Databases and Information Systems (DBIS) became operational at the beginning of 1990 when Peter Dadam joined the faculty. He came from the IBM Heidelberg Science Center (HDSC) where he managed the research department for Advanced Information Management (AIM). At the HDSC he was working on advanced database technology and applications and contributed to the development of the AIM-P system (see [1]). The second professor position was first occupied by Marc Scholl, who belonged to the DBIS department from 1992 to 1994. In 1996 Wolfgang Klas joined the DBIS department as second professor. He came from the GMD Institute for Integrated Publication and Information Systems (IPSI) where he managed the research division Distributed Multimedia Information Systems and was working on advanced object-oriented database systems technology, interoperable database systems, and multimedia information systems. At present, the DBIS team consists of the teaching and research assistants Thomas Bauer, Susanne Boll, Christian Heinlein, Clemens Hensinger, Erich Müller, Manfred Reichert, Birgit Schulthei&bgr;, the system engineer Rudi Seifert, the secretary Christiane Köppl, and the doctoral students Thomas Beuter and Anita Krämer. In the following, we concentrate on the research and development work performed previously and presently in the research groups of Peter Dadam and of Wolfgang Klas. For references to Marc Scholl's work please visit http://www.informatik.uni-konstanz.de/dbis." SIGMOD Record Query Previews for Networked Information Systems: A Case Study with NASA Environmental Data. Khoa Doan,Catherine Plaisant,Ben Shneiderman,Tom Bruns 1997 "Formulating queries on networked information systems is laden with problems: data diversity, data complexity, network growth, varied user base, and slow network access. This paper proposes a new approach to a network query user interface which consists of two phases: query preview and query refinement. This new approach is based on dynamic queries and tight coupling, guiding users to rapidly and dynamically eliminate undesired items, reduce the data volume to a manageable size, and refine queries locally before submission over a network. A two-phase dynamic query system for NASA's Earth Observing Systems--Data Information Systems (EOSDIS) is presented. The prototype was well received by the team of scientists who evaluated the interface." SIGMOD Record Converting Relational to Object-Oriented Databases. Joseph Fong 1997 As object-oriented model becomes the trend of database technology, there is a need to convert relational to object-oriented database system to improve productivity and flexibility. The changeover includes schema translation, data conversion and program conversion. This paper describes a methodology for integrating schema translation and data conversion. Schema translation involves semantic reconstruction and the mapping of relational schema into object-oriented schema. Data conversion involves unloading tuples of relations into sequential files and reloading them into object-oriented classes files. The methodology preserves the constraints of the relational database by mapping the equivalent data dependencies. SIGMOD Record "Editor's Notes." Michael J. Franklin 1997 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 1997 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 1997 "Editor's Notes." SIGMOD Record A Query Language for a Web-Site Management System. Mary F. Fernandez,Daniela Florescu,Alon Y. Levy,Dan Suciu 1997 A Query Language for a Web-Site Management System. SIGMOD Record Data Management for Earth System Science. James Frew,Jeff Dozier 1997 Earth system science is a relatively recent scientific discipline that seeks a global-scale understanding of the components, interactions, and evolution of the entire Earth system. The data being collected in support of Earth system science are rapidly approaching petabytes per year. The intrinsic problems of archiving, searching, and distributing such a huge dataset are compounded by both the heterogeneity of the data, and the heterogeneous nature of Earth system science inquiry, which synthesizes models, observations, and knowledge bases from a several traditional scientific disciplines.A successful data management environment for Earth system science must provide seamless access to arbitrary subsets and combinations of both local and remote data, and must be compatible with the rich data analysis environments already deployed. We describe a prototype of such an environment, built at UCSB using database technology pioneered by the Sequoia 2000 Project. We specifically address its application to a problem that requires combining point observations with gridded satellite imagery. SIGMOD Record Open GIS and On-Line Environmental Libraries. Kenn Gardels 1997 An essential component of an Environmental Information System is geographic or geospatial data coupled with geoprocessing functions. Traditional Geographic Information Systems (GIS) do not address the requirements of complex digital environmental libraries, but are now incorporating strategies for geodatabase federation, catalogs, and data mining. These strategies, however, depend on increased interoperability among diverse data stores, formats, and models. Open GIS™ is an abstration of geodata and a specification for methods on geographic features and coverages that enables compliant applications to exchange information and processing services. For EIS, Open GIS provides an architecture for selecting geodata at its most atomic level, fusing those data into structured information frameworks, analysing information using spatial operators, and viewing the results in informative, decision-supporting ways. SIGMOD Record The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb. Jim Gray,Goetz Graefe 1997 "Simple economic and performance arguments suggest appropriate lifetimes for main memory pages and suggest optimal page sizes. The fundamental tradeoffs are the prices and bandwidths of RAMs and disks. The analysis indicates that with today's technology, five minutes is a good lifetime for randomly accessed pages, one minute is a good lifetime for two-pass sequentially accessed pages, and 16 KB is a good size for index pages. These rules-of-thumb change in predictable ways as technology ratios change. They also motivate the importance of the new Kaps, Maps, Scans, and $/Kaps, $/Maps, $/TBscan metrics." SIGMOD Record "Environment Information Systems - Guest Editor's Foreword." Oliver Günther 1997 "Environment Information Systems - Guest Editor's Foreword." SIGMOD Record Virtual Database technology. Ashish Gupta,Venky Harinarayan,Anand Rajaraman 1997 Virtual Database technology. SIGMOD Record Information Systems Research at George Mason University. Sushil Jajodia,Daniel Barbará,Alexander Brodsky,Larry Kerschberg,Amihai Motro,Edgar H. Sibley,Xiaoyang Sean Wang 1997 "George Mason University began as an independent state university in 1972. Its development has been marked by rapid growth and innovative planning, resulting in an enrollment of more than 24,000 students in 1997. It is located in Fairfax, Virginia—about fifteen miles southwest of Washington, DC—near many governmental agencies and industrial firms specializing in information-intensive products and services. Information and Software Systems Engineering (ISSE) is one of six departments in GMU's School of Information Technology and Engineering (SITE). Established in 1985, SITE has approximately 90 faculty and ISSE has 13 full time faculty. ISSE is a rapidly growing department with wide-ranging teaching and research interests. The department offers no undergraduate degree programs and Master of Science degrees in Information Systems (MSIS) and Software Engineering (SWSE). MSIS has about 800 students and the SWSE has approximately 400 students enrolled. The MSIS program graduates about 120 students and the SWSE program awards 40 degrees per year. ISSE faculty participate in the SITE doctoral program in Information Technology. ISSE Faculty chair the committees of more than one third of the doctoral students in the SITE program, which currently graduates about 30 PhDs per year. Two research centers are associated with the department: The Center for Secure Information Systems (Sushil Jajodia, Director) and the Center for Information Systems Integration and Evolution (Larry Kerschberg, Director). Departmental research in information systems is supported by grants and contracts from several sources. The following awards have been received so far for the academic year 1997-1998 and beyond: Knowledge Rovers: A Family of Intelligent Software Agents for Logistics for the Warrior. Defense Advanced Research Projects Agency (co-PIs: Kerschberg, Gomaa, Jajodia, Motro) Electronic Commerce for Logistics, Teaming Agreement with American Management Systems for DARPA BAA 95-25 Logistics Research and Development (co-PIs: Kerschberg, Gomaa, Jajodia, Motro) Linear Constraint Databases, NSF Research Initiation Award (PI: Brodsky) Linear Constraint Programming, ONR (co-PI: Brodsky with late Kannelakis (PI), Van Hentenryck, and Lassez) Towards Expressive and Efficient Queries on Sequenced Data, NSF Research Initiation Award (PI: Wang) Supporting Multiple time granularities in Query Evaluation and Data Mining, NSF (co-PIs: Jajodia, Wang) Fine Granularity Access Controls in World Wide Web, NSA (PI: Jajodia) Information Flow Control in Object-Oriented Systems NSA (PI: Jajodia) Exploring Steganography: Seeing the Unseen, NSA (PI: Jajodia) Trusted Recovery from Information Attacks, Rome Laboratory (co-PIs: Jajodia, Ammann) A Unified Framework for Supporting Multiple Access Control Policies, DARPA (PI: Jajodia) The remainder of this article provides a brief overview of our research followed by a selected list of publications. More detailed information is available at www.isse.gmu.edu." SIGMOD Record An Extended Entity-Relationship Model for Geographic Applications. Thanasis Hadzilacos,Nectaria Tryfona 1997 "A special-purpose extension of the Entity-Relationship model for the needs of conceptual modeling of geographic applications, called the Geo-ER Model, is presented. Handling properties associated to objects not because of the objects' nature but because of the objects' position, calls for dealing -at the semantic modeling level-with space, location and dimensionality of objects, spatial relationships, space-depending attributes, and scale and generalization of representations. In order to accomplish this in the framework of ER and its derivatives, we introduce special entity sets, relationships, and add new constructs. The rationale as well as examples of usage of the Geo-ER model from actual projects are presented." SIGMOD Record Asserting Beliefs in MLS Relational Models. Nenad Jukic,Susan V. Vrbsky 1997 "Multilevel relations, based on the current multilevel secure (MLS) relational data models, can present a user with information that is difficult to interpret and may display an inconsistent outlook about the views of other users. Such ambiguity is due to the lack of a comprehensive method for asserting and interpreting beliefs about lower level information. In this paper we identify different beliefs that can be held by higher level users about lower level information, and we introduce the new concept of a mirage tuple. We present a mechanism for asserting beliefs about all accessible tuples, including lower level tuples. This mechanism provides every user of an MLS database with an unambiguous interpretation of all viewable information and presents a consistent account of the views at all levels below the user's level." SIGMOD Record Min-Max Compression Methods for Medical Image Databases. Kosmas Karadimitriou,John M. Tyler 1997 "The volume of medical imaging data produced per year is rapidly increasing, overtaxing the capabilities of Picture Archival and Communication (PACS) systems. Image compression methods can lessen the problem by encoding digital images into more space-efficient forms. Image compression is achieved by reducing redundancy in the imaging data. Existing methods reduce redundancy in individual images. However, these methods ignore an additional source of redundancy, which is based on the common information stored in more than one image in a set of similar images. We use the term ""set redundancy"" to describe this type of redundancy. Medical image databases contain large sets of similar images, therefore they also contain significant amounts of set redundancy.This paper presents two methods that extract set redundancy from medical imaging data: the Min-Max Differential (MMD), and the Min-Max Predictive (MMP) methods. These methods can improve compression of standard image compression techniques for sets of medical images. Our tests compressing CT brain scans have shown an average of as much as 129% improvement for Huffman encoding, 93% for Arithmetic Coding, and 37% for Lempel-Ziv compression when they are combined with Min-Max methods. Both MMD and MMP are based on reversible operations, hence they provide lossless compression." SIGMOD Record Research in Databases and Data-Intensive Applications - Computer Science Department and FZI, University of Karlsruhe. Birgitta König-Ries,Peter C. Lockemann 1997 The future world of computing will be governed by large networks of communicating and interacting persons and machines, geographic mobility, temporary attachment to networks, the disintegration of formerly monolithic organizations and systems into autonomously acting units, the substitution of cooperation regimes for centralized control, and an ever-increasing spectrum of ever more ambitious applications. In such a world the methods, techniques and tools of database technology will play new and more diversified roles, not so much in combination as parts of all-inclusive database systems but rather individually as indispensable ingredients of or desirable enhancements to novel communication, control and application systems. The two information systems groups whose work is presented in this report aim at meeting these new challenges. Our contributions are in the large field of what we call “distributed data-intensive applications”. SIGMOD Record WWW-UDK: A Web-based Environmental Meta-Information System. Ralf Kramer,Ralf Nikolai,Arne Koschel,Claudia Rolker,Peter C. Lockemann,Andree Keitel,Rudolf Legat,Konrad Zirm 1997 "The environmental data catalogue Umweltdatenkatalog UDK is a standard meta-information system for environmental data for use by state authorities and the public. Technically, the UDK consists of a database together with a front-end tailored to the needs of environmental specialists. FZI's contribution has been to develop a front-end that makes the UDK database available using the tools and techniques of the World-Wide Web. Among the features of WWW-UDK are several query modes for the UDK objects and addresses, an environmental thesaurus, on-line access to some of the underlying data (e.g., databases and environmental reports), multilingual query and result forms, and an on-line help system. Currently, several installations of WWW-UDK are used in Austria and in Germany on the Internet and on Intranets. WWW-UDK can be easily integrated into a federation architecture which is based on CORBA, WWW, and Java." SIGMOD Record Extracting Entity Profiles from Semistructured Information Spaces. Robert A. Nado,Scott B. Huffman 1997 "A semistructured information space consists of multiple collections of textual documents containing fielded or tagged sections. The space can be highly heterogeneous, because each collection has its own schema, and there are no enforced keys or formats for data items across collections. Thus, structured methods like SQL cannot be easily employed, and users often must make do with only full-text search. In this paper, we describe an approach that provides structured querying for particular types of entities, such as companies and people. Entity-based retrieval is enabled by normalizing entity references in a heuristic, type-dependent manner. The approach can be used to retrieve documents and can also be used to construct entity profiles — summaries of commonly sought information about an entity based on the documents' content. The approach requires only a modest amount of meta-information about the source collections, much of which is derived automatically." SIGMOD Record Workshop on Workflow Management in Scientific and Engineering Applications - Report. Richard McClatchey,Gottfried Vossen 1997 Workshop on Workflow Management in Scientific and Engineering Applications - Report. SIGMOD Record Lore: A Database Management System for Semistructured Data. Jason McHugh,Serge Abiteboul,Roy Goldman,Dallan Quass,Jennifer Widom 1997 Lore (for Lightweight Object Repository) is a DBMS designed specifically for managing semistructured information. Implementing Lore has required rethinking all aspects of a DBMS, including storage management, indexing, query processing and optimization, and user interfaces. This paper provides an overview of these aspects of the Lore system, as well as other novel features such as dynamic structural summaries and seamless access to data from external sources. SIGMOD Record Integrating Dynamically-Fetched External Information into a DBMS for Semistructured Data. Jason McHugh,Jennifer Widom 1997 "We describe the external data manager component of the Lore database system for semistructured data. Lore's external data manager enables dynamic retrieval and integration of data from arbitrary, heterogeneous external sources during query processing. The distinction between Lore-resident and external data is invisible to the user. We introduce a flexible notion of arguments that limits the amount of data fetched from an external source, and we have incorporated optimizations to reduce the number of calls to an external source." SIGMOD Record Infering Structure in Semistructured Data. Svetlozar Nestorov,Serge Abiteboul,Rajeev Motwani 1997 When dealing with semistructured data such as that available on the Web, it becomes important to infer the inherent structure, both for the user (e.g., to facilitate querying) and for the system (e.g., to optimize access). In this paper, we consider the problem of identifying some underlying structure in large collections of semistructured data. Since we expect the data to be fairly irregular, this structure consists of an approximate classification of objects into a hierarchical collection of types. We propose a notion of a type hierarchy for such data, and outline a method for deriving the type hierarchy, and rules for assigning types to data elements. SIGMOD Record Opportunities in Information Management and Assurance. Xiaolei Qian 1997 Opportunities in Information Management and Assurance. SIGMOD Record "Report on DART '96: Databases: Active and Real-Time (Concepts meet Practice)." Krithi Ramamritham,Nandit Soparkar 1997 "DART '96 was held in conjunction with the Conference of Information and Knowledge Management (CIKM) on Nov 15th in Baltimore. Its goal was to provide a forum for researchers and practitioners involved in integrating concepts and technologies from active and real-time databases to discuss the state of the art and chart a course of action. To this end, nine speakers from academia, industry, and research laboratories were invited to provide a perspective on the theory and practice underlying active real-time databases. In addition, some selected papers were presented briefly to complement the invited speakers' talks. The second half of the workshop was devoted to discussions aimed at identifying the problems that still need to be addressed in the contexts of the diverse target applications." SIGMOD Record Extraction of Object-Oriented Structures from Existing Relational Databases. Shekar Ramanathan,Julia E. Hodges 1997 Due to the wide use of object-oriented technology in software development and the existence of many relational databases, reverse engineering of relational schemas to object-oriented schemas is gaining in interest. One of the major problems with existing approaches for this schema mapping is that they fail to take into consideration many modern relational database design alternatives (e.g., use of binary data to store multiple-valued attributes). This paper presents a schema mapping procedure that can be applied on existing relational databases without changing their schema. The procedure maps a relational schema that is at least in 2NF into an object-oriented schema by taking into consideration various types of relational database design optimizations. SIGMOD Record Management of Data and Services in the Environmental Information System (UIS) of Baden-Württemberg. Wolf-Fritz Riekert,Roland Mayer-Föll,Gerlinde Wiest 1997 Management of Data and Services in the Environmental Information System (UIS) of Baden-Württemberg. SIGMOD Record "A Consumer Viewpoint on ""Mediator Languages - a Proposal for a Standard""." Arnon Rosenthal,Eric Hughes,Scott Renner,Leonard J. Seligman 1997 "A Consumer Viewpoint on ""Mediator Languages - a Proposal for a Standard""." SIGMOD Record Industry Perspectives. Leonard J. Seligman 1997 Industry Perspectives. SIGMOD Record Database Systems - Breaking Out of the Box. Abraham Silberschatz,Stanley B. Zdonik 1997 Database Systems - Breaking Out of the Box. SIGMOD Record Foreword: Management of Semistructured Data. Dan Suciu 1997 Foreword: Management of Semistructured Data. SIGMOD Record "Chair's Message." Richard T. Snodgrass 1997 "Chair's Message." SIGMOD Record "Chair's Message." Richard T. Snodgrass 1997 "Chair's Message." SIGMOD Record Improving Access to Environmental Data Using Context Information. Anthony Tomasic,Eric Simon 1997 A very large number of data sources on environment, energy, and natural resources are available worldwide. Unfortunately, users usually face several problems when they want to search and use environmental information. In this paper, we analyze these problems. We describe a conceptual analysis of the four major tasks in the production of environmental data, from the technology point of view, and describe the organization of the data that results from these tasks. We then discuss the notion of metainformation and outline an architecture for environmental data systems that formally models metadata and addresses some of the major problems faced by users. ICDE Network Latency Optimizations in Distributed Database Systems. Sujata Banerjee,Panos K. Chrysanthis 1998 Network Latency Optimizations in Distributed Database Systems. ICDE Online Generation of Association Rules. Charu C. Aggarwal,Philip S. Yu 1998 Online Generation of Association Rules. ICDE The Active Hypermedia Delivery System (AHYDS) using the PHASME Application-Oriented DBMS. Frédéric Andrès,Kinji Ono 1998 The Active Hypermedia Delivery System (AHYDS) using the PHASME Application-Oriented DBMS. ICDE "Generalizing ``Search'' in Generalized Search Trees (Extended Abstract)." Paul M. Aoki 1998 "Generalizing ``Search'' in Generalized Search Trees (Extended Abstract)." ICDE WebOQL: Restructuring Documents, Databases, and Webs. Gustavo O. Arocena,Alberto O. Mendelzon 1998 WebOQL: Restructuring Documents, Databases, and Webs. ICDE Design and Implementation of Display Specification for Multimedia Answers. Chitta Baral,Graciela Gonzalez,Tran Cao Son 1998 Design and Implementation of Display Specification for Multimedia Answers. ICDE Point-Versus Interval-Based Temporal Data Models. Michael H. Böhlen,Renato Busatto,Christian S. Jensen 1998 Point-Versus Interval-Based Temporal Data Models. ICDE Outstanding Challenges in OLAP. Jeffrey A. Bedell 1998 Outstanding Challenges in OLAP. ICDE Fast Nearest Neighbor Search in High-Dimensional Space. Stefan Berchtold,Bernhard Ertl,Daniel A. Keim,Hans-Peter Kriegel,Thomas Seidl 1998 Fast Nearest Neighbor Search in High-Dimensional Space. ICDE Design and Performance of an Assertional Concurrency Control System. Arthur J. Bernstein,David Scott Gerstl,Wai-Hong Leung,Philip M. Lewis 1998 Design and Performance of an Assertional Concurrency Control System. ICDE Flattening an Object Algebra to Provide Performance. Peter A. Boncz,Annita N. Wilschut,Martin L. Kersten 1998 Flattening an Object Algebra to Provide Performance. ICDE "General Chair's Message, Program Co-Chairs' Message, Committees, Reviewers, Author Index." 1998 "General Chair's Message, Program Co-Chairs' Message, Committees, Reviewers, Author Index." ICDE Representing and Querying Changes in Semistructured Data. Sudarshan S. Chawathe,Serge Abiteboul,Jennifer Widom 1998 Representing and Querying Changes in Semistructured Data. ICDE Global Integration of Visual Databases. Wendy Chang,Deepak Murthy,Aidong Zhang,Tanveer Fathima Syeda-Mahmood 1998 Global Integration of Visual Databases. ICDE Dynamic Granular Locking Approach to Phantom Protection in R-Trees. Kaushik Chakrabarti,Sharad Mehrotra 1998 Dynamic Granular Locking Approach to Phantom Protection in R-Trees. ICDE Future Directions in Database Research (Panel). Surajit Chaudhuri,Hector Garcia-Molina,Henry F. Korth,Guy M. Lohman,David B. Lomet,David Maier 1998 Future Directions in Database Research (Panel). ICDE ECA Rule Support for Distributed Heterogeneous Environments. Sharma Chakravarthy,Roger Le 1998 ECA Rule Support for Distributed Heterogeneous Environments. ICDE Cache Management for Mobile Databases: Design and Evaluation. Boris Y. L. Chan,Antonio Si,Hong Va Leong 1998 Cache Management for Mobile Databases: Design and Evaluation. ICDE Redbrick Vista: Aggregate Computation and Management. Latha S. Colby,Richard L. Cole,Edward Haslam,Nasi Jazayeri,Galt Johnson,William J. McKenna,Lee Schumacher,David Wilhite 1998 Redbrick Vista: Aggregate Computation and Management. ICDE Optimizing Regular Path Expressions Using Graph Schemas. Mary F. Fernandez,Dan Suciu 1998 Optimizing Regular Path Expressions Using Graph Schemas. ICDE Data Logging: A Method for Efficient Data Updates in Constantly Active RAIDs. Eran Gabber,Henry F. Korth 1998 Data Logging: A Method for Efficient Data Updates in Constantly Active RAIDs. ICDE Safeguarding and Charging for Information on the Internet. Hector Garcia-Molina,Steven P. Ketchpel,Narayanan Shivakumar 1998 Safeguarding and Charging for Information on the Internet. ICDE Messaging/Queuing in Oracle8. Dieter Gawlick 1998 Messaging/Queuing in Oracle8. ICDE Compressing Relations and Indexes. Jonathan Goldstein,Raghu Ramakrishnan,Uri Shaft 1998 Compressing Relations and Indexes. ICDE The New Database Imperatives. Goetz Graefe 1998 The New Database Imperatives. ICDE Query Folding with Inclusion Dependencies. Jarek Gryz 1998 Query Folding with Inclusion Dependencies. ICDE Junglee: Integrating Data of All Shapes and Sizes. Ashish Gupta 1998 Junglee: Integrating Data of All Shapes and Sizes. ICDE Virtual Database Technology. Ashish Gupta,Venky Harinarayan,Anand Rajaraman 1998 Virtual Database Technology. ICDE DB-MAN: A Distributed Database System based on Database Migration in ATM Networks. Takahiro Hara,Kaname Harumoto,Masahiko Tsukamoto,Shojiro Nishio 1998 DB-MAN: A Distributed Database System based on Database Migration in ATM Networks. ICDE The LSD-Tree: An Access Structure for Feature Vectors. Andreas Henrich 1998 The LSD-Tree: An Access Structure for Feature Vectors. ICDE Processing Incremental Multidimensional Range Queries in a Direct Manipulation Visual Query. Stacie Hibino,Elke A. Rundensteiner 1998 Processing Incremental Multidimensional Range Queries in a Direct Manipulation Visual Query. ICDE Efficient Discovery of Functional and Approximate Dependencies Using Partitions. Ykä Huhtala,Juha Kärkkäinen,Pasi Porkka,Hannu Toivonen 1998 Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE Distributed Video Presentations. Eenjun Hwang,V. S. Subrahmanian,B. Prabhakaran 1998 Distributed Video Presentations. ICDE An Extended Object-Oriented Database Approach to Networked Multimedia Applications. Hiroshi Ishikawa,Koki Kato,Miyuki Ono,Naomi Yoshizawa,Kazumi Kubota,Akiko Kanaya 1998 An Extended Object-Oriented Database Approach to Networked Multimedia Applications. ICDE Failure Handling and Coordinated Execution of Concurrent Workflows. Mohan Kamath,Krithi Ramamritham 1998 Failure Handling and Coordinated Execution of Concurrent Workflows. ICDE Asynchronous Version Advancement in a Distributed Three-Version Database. H. V. Jagadish,Inderpal Singh Mumick,Michael Rabinovich 1998 Asynchronous Version Advancement in a Distributed Three-Version Database. ICDE Content-based Multimedia Information Management. Ramesh Jain 1998 Content-based Multimedia Information Management. ICDE SEMCOG: A Hybrid Object-based Image Database System and Its Modeling, Language, and Query Processing. Wen-Syan Li,K. Selçuk Candan 1998 SEMCOG: A Hybrid Object-based Image Database System and Its Modeling, Language, and Query Processing. ICDE Coarse Indices for a Tape-Based Data Warehouse. Theodore Johnson 1998 Coarse Indices for a Tape-Based Data Warehouse. ICDE A Graphical Editor for the Conceptual Design of Business Rules. Peter Lang,Werner Obermair,W. Kraus,Thomas Thalhammer 1998 A Graphical Editor for the Conceptual Design of Business Rules. ICDE High Dimensional Similarity Joins: Algorithms and Performance Evaluation. Nick Koudas,Kenneth C. Sevcik 1998 Current data repositories include a variety of data types, including audio, images, and time series. State-of-the-art techniques for indexing such data and doing query processing rely on a transformation of data elements into points in a multidimensional feature space. Indexing and query processing then take place in the feature space. In this paper, we study algorithms for finding relationships among points in multidimensional feature spaces, specifically algorithms for multidimensional joins. Like joins of conventional relations, correlations between multidimensional feature spaces can offer valuable information about the data sets involved. We present several algorithmic paradigms for solving the multidimensional join problem and we discuss their features and limitations. We propose a generalization of the Size Separation Spatial Join algorithm, named Multidimensional Spatial Join (MSJ), to solve the multidimensional join problem. We evaluate MSJ along with several other specific algorithms, comparing their performance for various dimensionalities on both real and synthetic multidimensional data sets. Our experimental results indicate that MSJ, which is based on space filling curves, consistently yields good performance across a wide range of dimensionalities. ICDE On Query Spreadsheets. Laks V. S. Lakshmanan,Subbu N. Subramanian,Nita Goyal,Ravi Krishnamurthy 1998 On Query Spreadsheets. ICDE The Effect of Buffering on the Performance of R-Trees. Scott T. Leutenegger,Mario A. Lopez 1998 Past R-tree studies have focused on the number of nodes visited as a metric of query performance. Since database systems usually include a buffering mechanism, we propose that the number of disk accesses is a more realistic measure of performance. We develop a buffer model to analyze the number of disk accesses required for spatial queries using R-trees. The model can be used to evaluate the quality of R-tree update operations, such as various node splitting and tree restructuring policies, as measured by query performance on the resulting tree. We use our model to study the performance of three well-known R-tree loading algorithms. We show that ignoring buffer behavior and using number of nodes accessed as a performance metric can lead to incorrect conclusions, not only quantitatively, but also qualitatively. In addition, we consider the problem of how many levels of the R-tree should be pinned in the buffer. ICDE Building A Robust Workflow Management System With Persistent Queues and Stored Procedures. Frank Leymann,Dieter Roller 1998 Building A Robust Workflow Management System With Persistent Queues and Stored Procedures. ICDE Parallelizing Loops in Database Programming Languages. Daniel F. Lieuwen 1998 Parallelizing Loops in Database Programming Languages. ICDE Mining Association Rules: Anti-Skew Algorithms. Jun-Lin Lin,Margaret H. Dunham 1998 Mining Association Rules: Anti-Skew Algorithms. ICDE Methodical Restructuring of Complex Workflow Activities. Ling Liu,Calton Pu 1998 Methodical Restructuring of Complex Workflow Activities. ICDE Query Processing in a Video Retrieval System. King-Lup Liu,A. Prasad Sistla,Clement T. Yu,Naphtali Rishe 1998 Query Processing in a Video Retrieval System. ICDE ROL: A Prototype for Deductive and Object-Oriented Databases (Demo). Mengchi Liu,Weidong Yu,Min Guo,Riqiang Shan 1998 ROL: A Prototype for Deductive and Object-Oriented Databases (Demo). ICDE Persistent Applications Using Generalized Redo Recovery. David B. Lomet 1998 Persistent Applications Using Generalized Redo Recovery. ICDE Grouping Techniques for Update Propagation in Intermittently Connected Databases. Sameer Mahajan,Michael J. Donahoo,Shamkant B. Navathe,Mostafa H. Ammar,Sanjoy Malik 1998 Grouping Techniques for Update Propagation in Intermittently Connected Databases. ICDE A Tightly-Coupled Architecture for Data Mining. Rosa Meo,Giuseppe Psaila,Stefano Ceri 1998 A Tightly-Coupled Architecture for Data Mining. ICDE ZEBRA Image Access System. Srilekha Mudumbai,Kshitij Shah,Amit P. Sheth,Krishnan Parasuraman,Clemens Bertram 1998 ZEBRA Image Access System. ICDE Data Warehousing Lessons From Experience (Panel). "Patrick E. O'Neil,Richard Winter,Clark D. French,Dan Crowley,William J. McKenna" 1998 Data Warehousing Lessons From Experience (Panel). ICDE Leveraging Mediator Cost Models with Heterogeneous Data Sources. Hubert Naacke,Georges Gardarin,Anthony Tomasic 1998 Leveraging Mediator Cost Models with Heterogeneous Data Sources. ICDE Performance Analysis of Parallel Hash Join Algorithms on a Distributed Shared Memory Machine: Implementation and Evaluation on HP Exemplar SPP 1600. Miyuki Nakano,Hiroomi Imai,Masaru Kitsuregawa 1998 Performance Analysis of Parallel Hash Join Algorithms on a Distributed Shared Memory Machine: Implementation and Evaluation on HP Exemplar SPP 1600. ICDE Cyclic Association Rules. Banu Özden,Sridhar Ramaswamy,Abraham Silberschatz 1998 Cyclic Association Rules. ICDE The Alps at Your Fingertips: Virtual Reality and Geoinformation Systems. Renato Pajarola,Thomas Ohler,Peter Stucki,Kornel Szabo,Peter Widmayer 1998 The Alps at Your Fingertips: Virtual Reality and Geoinformation Systems. ICDE Cyclic Allocation of Two-Dimensional Data. Sunil Prabhakar,Khaled A. S. Abdel-Ghaffar,Divyakant Agrawal,Amr El Abbadi 1998 Various proposals have been made for declustering two-dimensionally tiled data on multiple I/O devices. Recently, it was shown that strictly optimal solutions only exist under very restrictive conditions on the tiling of the two-dimensional space or for very few I/O devices. In this paper we explore allocation methods where no strictly optimal solution exists. We propose a general class of allocation methods, referred to as cyclic declustering methods, and show that many existing methods are instances of this class. As a result, various seemingly ad hoc and unrelated methods are presented in a single framework. Furthermore, the framework is used to develop new allocation methods that give better performance than any previous method and that approach the best feasible performance. ICDE WWW and the Internet - Did We Miss the Boat? (Panel). Michael Rabinovich,C. Mic Bowman,Hector Garcia-Molina,Alon Y. Levy,Susan Malaika,Alberto O. Mendelzon 1998 WWW and the Internet - Did We Miss the Boat? (Panel). ICDE Mining Optimized Association Rules with Categorical and Numeric Attributes. Rajeev Rastogi,Kyuseok Shim 1998 Mining association rules on large data sets has received considerable attention in recent years. Association rules are useful for determining correlations between attributes of a relation and have applications in marketing, financial, and retail sectors. Furthermore, optimized association rules are an effective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to contain uninstantiated attributes and the problem is to determine instantiations such that either the support or confidence of the rule is maximized. In this paper, we generalize the optimized association rules problem in three ways: 1) association rules are allowed to contain disjunctions over uninstantiated attributes, 2) association rules are permitted to contain an arbitrary number of uninstantiated attributes, and 3) uninstantiated attributes can be either categorical or numeric. Our generalized association rules enable us to extract more useful information about seasonal and local patterns involving multiple attributes. We present effective techniques for pruning the search space when computing optimized association rules for both categorical and numeric attributes. Finally, we report the results of our experiments that indicate that our pruning algorithms are efficient for a large number of uninstantiated attributes, disjunctions, and values in the domain of the attributes. ICDE Ending the MOLAP/ROLAP Debate: Usage Based Aggregation and Flexible HOLAP (Abstract). Corey Salka 1998 Ending the MOLAP/ROLAP Debate: Usage Based Aggregation and Flexible HOLAP (Abstract). ICDE Mining for Strong Negative Associations in a Large Database of Customer Transactions. Ashok Savasere,Edward Omiecinski,Shamkant B. Navathe 1998 Mining for Strong Negative Associations in a Large Database of Customer Transactions. ICDE Encoded Bitmap Indexing for Data Warehouses. Ming-Chuan Wu,Alejandro P. Buchmann 1998 Encoded Bitmap Indexing for Data Warehouses. ICDE Data Intensive Intra- & Internet Applications - Experiences Using Java and CORBA in the World Wide Web. Jürgen Sellentin,Bernhard Mitschang 1998 Data Intensive Intra- & Internet Applications - Experiences Using Java and CORBA in the World Wide Web. ICDE Industry Applications of Data Mining: Challenges & Opportunities (Abstract). Evangelos Simoudis 1998 Industry Applications of Data Mining: Challenges & Opportunities (Abstract). ICDE Employing Intelligent Agents for Knowledge Discovery (Abstract). Earl Stahl 1998 Employing Intelligent Agents for Knowledge Discovery (Abstract). ICDE Concurrent Operations in a Distributed and Mobile Collaborative Environment. Maher Suleiman,Michèle Cart,Jean Ferrié 1998 Concurrent Operations in a Distributed and Mobile Collaborative Environment. ICDE Cost Models for Join Queries in Spatial Databases. Yannis Theodoridis,Emmanuel Stefanakis,Timos K. Sellis 1998 Cost Models for Join Queries in Spatial Databases. ICDE Migrating Legacy Databases and Applications (Panel). Bhavani M. Thuraisingham,Sandra Heiler,Arnon Rosenthal,Susan Malaika 1998 Migrating Legacy Databases and Applications (Panel). ICDE Remote Load-Sensitive Caching for Multi-Server Database Systems. Shivakumar Venkataraman,Jeffrey F. Naughton,Miron Livny 1998 Remote Load-Sensitive Caching for Multi-Server Database Systems. ICDE Cost and Imprecision in Modeling the Position of Moving Objects. Ouri Wolfson,Sam Chamberlain,Son Dao,Liqin Jiang,Gisela Mendez 1998 Cost and Imprecision in Modeling the Position of Moving Objects. ICDE Fuzzy Triggers: Incorporating Imprecise Reasoning into Active Databases. Antoni Wolski,Tarik Bouaziz 1998 Fuzzy Triggers: Incorporating Imprecise Reasoning into Active Databases. ICDE A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. Xiaowei Xu,Martin Ester,Hans-Peter Kriegel,Jörg Sander 1998 A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE Array-Based Evaluation of Multi-Dimensional Queries in Object-Relational Databases Systems. Yihong Zhao,Karthikeyan Ramasamy,Kristin Tufte,Jeffrey F. Naughton 1998 Array-Based Evaluation of Multi-Dimensional Queries in Object-Relational Databases Systems. ICDE Efficient Retrieval of Similar Time Sequences Under Time Warping. Byoung-Kee Yi,H. V. Jagadish,Christos Faloutsos 1998 Efficient Retrieval of Similar Time Sequences Under Time Warping. ICDE Graph Structured Views and Their Incremental Maintenance. Yue Zhuge,Hector Garcia-Molina 1998 Graph Structured Views and Their Incremental Maintenance. ICDE Back to the Future: Dynamic Hierarchical Clustering. Chendong Zou,Betty Salzberg,Rivka Ladin 1998 Back to the Future: Dynamic Hierarchical Clustering. SIGMOD Conference "Oracle Rdb's Record Caching Model." Richard Anderson,Gopalan Arun,Richard Frank 1998 In this paper we present a more efficient record based caching model than the conventional page (disk block) based scheme. In a record caching model, individual records are stored together in a section of shared memory to form the cache. Traditional relational database systems have individual pages that are stored together in shared memory to form the cache and records are then extracted from these pages on demand. The record cache model has better memory utilization than the page model and also helps reduce overheads like page fetches/writes, page locks and code path. In May 1996, Oracle Rdb announced a record breaking number of 14227 tpmC on a Digital AlphaServer 8400. At the time, this was the best TPC-C performance achieved on a single SMP machine. A total of 15 record caches, caching 19.5 million records, consuming almost 7 GB of memory, formed the bulk of the shared memory. SIGMOD Conference Replication, Consistency, and Practicality: Are These Mutually Exclusive? Todd A. Anderson,Yuri Breitbart,Henry F. Korth,Avishai Wool 1998 Previous papers have postulated that traditional schemes for the management of replicated data are doomed to failure in practice due to a quartic (or worse) explosion in the probability of deadlocks. In this paper, we present results of a simulation study for three recently introduced protocols that guarantee global serializability and transaction atomicity without resorting to the two-phase commit protocol. The protocols analyzed in this paper include a global locking protocol [10], a “pessimistic” protocol based on a replication graph [5], and an “optimistic” protocol based on a replication graph [7]. The results of the study show a wide range of practical applicability for the lazy replica-update approach employed in these protocols. We show that under reasonable contention conditions and sufficiently high transaction rate, both replication-graph-based protocols outperform the global locking protocol. The distinctions among the protocols in terms of performance are significant. For example, an offered load where 70% - 80% of transactions under the global locking protocol were aborted, only 10% of transactions were aborted under the protocols based on the replication graph. The results of the study suggest that protocols based on a replication graph offer practical techniques for replica management. However, it also shows that performance deteriorates rapidly and dramatically when transaction throughput reaches a saturation point. SIGMOD Conference About Quark Digital Media System. Kamar Aulakh 1998 In this paper, we describe the Oracle Large User Population Demonstration and highlight the scalability mechanisms in the Oracle8 Universal Data Server which make it possible to support as many as 50,000 concurrent users on a single Oracle8 database without any middle-tier TP-monitor software. Supporting such large user populations requires many mechanisms for high concurrency and throughput. Algorithms in all areas of the server ranging from process and buffer management to SQL compilation and execution must be designed to be highly scalable. Efficient resource sharing mechanisms are required to prevent server-side resource requirements from growing unboundedly with the number of users. Parallel execution across multiple systems is necessary to allow user-population and throughput to scale beyond the restrictions of a single system. In addition to scalability, mechanisms for high availability, ease-of-use, and rich functionality are necessary for supporting complex user applications typical of realistic workloads. All mechanisms must be portable to a wide variety of installations ranging from desk-top systems to large scale enterprise servers and to a wide variety of operating systems. SIGMOD Conference Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Rakesh Agrawal,Johannes Gehrke,Dimitrios Gunopulos,Prabhakar Raghavan 1998 Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets. SIGMOD Conference A Multi-Similarity Algebra. Sibel Adali,Piero A. Bonatti,Maria Luisa Sapino,V. S. Subrahmanian 1998 The need to automatically extract and classify the contents of multimedia data archives such as images, video, and text documents has led to significant work on similarity based retrieval of data. To date, most work in this area has focused on the creation of index structures for similarity based retrieval. There is very little work on developing formalisms for querying multimedia databases that support similarity based computations and optimizing such queries, even though it is well known that feature extraction and identification algorithms in media data are very expensive. We introduce a similarity algebra that brings together relational operators and results of multiple similarity implementations in a uniform language. The algebra can be used to specify complex queries that combine different interpretations of similarity values and multiple algorithms for computing these values. We prove equivalence and containment relationships between similarity algebra expressions and develop query rewriting methods based on these results. We then provide a generic cost model for evaluating cost of query plans in the similarity algebra and query optimization methods based on this model. We supplement the paper with experimental results that illustrate the use of the algebra and the effectiveness of query optimization methods using the Integrated Search Engine (I.SEE) as the testbed. SIGMOD Conference Electronic Commerce: Tutorial. Nabil R. Adam,Yelena Yesha 1998 As we embark on the information age the use of electronic information is spreading through all sectors of society, both nationally and internationally. As a result, commercial organizations, educational institutions and government agencies are finding it essential to be linked by world wide networks, and commercial Internet usage is growing at an accelerating pace. SIGMOD Conference NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. Brad Adelberg 1998 NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. SIGMOD Conference ARIADNE: A System for Constructing Mediators for Internet Sources. José Luis Ambite,Naveen Ashish,Greg Barish,Craig A. Knoblock,Steven Minton,Pragnesh Jay Modi,Ion Muslea,Andrew Philpot,Sheila Tejada 1998 The Web is based on a browsing paradigm that makes it difficult to retrieve and integrate data from multiple sites. Today, the only way to achieve this integration is by building specialized applications, which are time-consuming to develop and difficult to maintain. We are addressing this problem by creating the technology and tools for rapidly constructing information mediators that extract, query, and integrate data from web sources. The resulting system, called Ariadne, makes it feasible to rapidly build information mediators that access existing web sources. SIGMOD Conference The Pyramid-Technique: Towards Breaking the Curse of Dimensionality. Stefan Berchtold,Christian Böhm,Hans-Peter Kriegel 1998 In this paper, we propose the Pyramid-Technique, a new indexing method for high-dimensional data spaces. The Pyramid-Technique is highly adapted to range query processing using the maximum metric Lmax. In contrast to all other index structures, the performance of the Pyramid-Technique does not deteriorate when processing range queries on data of higher dimensionality. The Pyramid-Technique is based on a special partitioning strategy which is optimized for high-dimensional data. The basic idea is to divide the data space first into 2d pyramids sharing the center point of the space as a top. In a second step, the single pyramids are cut into slices parallel to the basis of the pyramid. These slices from the data pages. Furthermore, we show that this partition provides a mapping from the given d-dimensional space to a 1-dimensional space. Therefore, we are able to use a B+-tree to manage the transformed data. As an analytical evaluation of our technique for hypercube range queries and uniform data distribution shows, the Pyramid-Technique clearly outperforms index structures using other partitioning strategies. To demonstrate the practical relevance of our technique, we experimentally compared the Pyramid-Technique with the X-tree, the Hilbert R-tree, and the Linear Scan. The results of our experiments using both, synthetic and real data, demonstrate that the Pyramid-Technique outperforms the X-tree and the Hilbert R-tree by a factor of up to 14 (number of page accesses) and up to 2500 (total elapsed time) for range queries. SIGMOD Conference "High-Dimensional Index Structures, Database Support for Next Decade's Applications (Tutorial)." Stefan Berchtold,Daniel A. Keim 1998 "High-Dimensional Index Structures, Database Support for Next Decade's Applications (Tutorial)." SIGMOD Conference CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. Ron Avnur,Joseph M. Hellerstein,Bruce Lo,Chris Olston,Bhaskaran Raman,Vijayshankar Raman,Tali Roth,Kirk Wylie 1998 The CONTROL project at U.C. Berkeley has developed technologies to provide online behavior for data-intensive applications. Using new query processing algorithms, these technologies continuously improve estimates and confidence statistics. In addition, they react to user feedback, thereby giving the user control over the behavior of long-running operations. This demonstration displays the modifications to a database system and the resulting impact on aggregation queries, data visualization, and GUI widgets. We then compare this interactive behavior to batch-processing alternatives. SIGMOD Conference The Multidimensional Database System RasDaMan. Peter Baumann,Andreas Dehmel,Paula Furtado,Roland Ritsch,Norbert Widmann 1998 RasDaMan is a universal — i.e., domain-independent — array DBMS for multidimensional arrays of arbitrary size and structure. A declarative, SQL-based array query language offers flexible retrieval and manipulation. Efficient server-based query evaluation is enabled by an intelligent optimizer and a streamlined storage architecture based on flexible array tiling and compression. RasDaMan is being used in several international projects for the management of geo and healthcare data of various dimensionality. SIGMOD Conference Efficiently Mining Long Patterns from Databases. Roberto J. Bayardo Jr. 1998 We present a pattern-mining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data show that when the patterns are long, our algorithm is more efficient by an order of magnitude or more. SIGMOD Conference Microsoft Universal Data Access Platform. José A. Blakeley,Michael Pizzo 1998 Microsoft Universal Data Access defines a platform for developing multi-tier enterprise applications that require efficient access to diverse relational or non-relational data sources across intranets or the Internet. Universal Data Access consists of a collection of software components that interact with each other using system-level interfaces defined by OLE DB and providing an application-level data access model called ActiveX Data Objects (ADO). This talk provides an overview of the platform. SIGMOD Conference Delivering High Availability for Inktomi Search Engines. Eric A. Brewer 1998 "Inktomi provides the back-end for several well-known search engines, including Wired's HotBot and Microsoft's MS Start page. The services are supported by a highly available cluster with more than 300 CPUs and several hundred disks." SIGMOD Conference The IDEA Web Lab. Stefano Ceri,Piero Fraternali,Stefano Paraboschi 1998 With the spreading of the World Wide Web as a uniform and ubiquitous interface to computer applications and information, novel opportunities are offered for introducing significant changes in all organizations and their processes. This demo presents the IDEA Web Laboratory (Web Lab), a Web-based software design environment available on the Internet, which demonstrates a novel approach to the software production process on the Web. SIGMOD Conference Enhanced Hypertext Categorization Using Hyperlinks. Soumen Chakrabarti,Byron Dom,Piotr Indyk 1998 A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%. SIGMOD Conference Transactional Publish / Subscribe: The Proactive Multicast of Database Changes. Arvola Chan 1998 Transactional Publish / Subscribe: The Proactive Multicast of Database Changes. SIGMOD Conference Bitmap Index Design and Evaluation. Chee Yong Chan,Yannis E. Ioannidis 1998 Bitmap indexing has been touted as a promising approach for processing complex adhoc queries in read-mostly environments, like those of decision support systems. Nevertheless, only few possible bitmap schemes have been proposed in the past and very little is known about the space-time tradeoff that they offer. In this paper, we present a general framework to study the design space of bitmap indexes for selection queries and examine the disk-space and time characteristics that the various alternative index choices offer. In particular, we draw a parallel between bitmap indexing and number representation in different number systems, and define a space of two orthogonal dimensions that captures a wide array of bitmap indexes, both old and new. Within that space, we identify (analytically or experimentally) the following interesting points: (1) the time-optimal bitmap index; (2) the space-optimal bitmap index; (3) the bitmap index with the optimal space-time tradeoff (knee); and (4) the time-optimal bitmap index under a given disk-space constraint. Finally, we examine the impact of bitmap compression and bitmap buffering on the space-time tradeoffs among those indexes. As part of this work, we also describe a bitmap-index-based evaluation algorithm for selection queries that represents an improvement over earlier proposals. We believe that this study offers a useful first set of guidelines for physical database design using bitmap indexes. SIGMOD Conference Free Parallel Data Mining. Bin Li,Dennis Shasha 1998 Data mining is computationally expensive. Since the benefits of data mining results are unpredictable, organizations may not be willing to buy new hardware for that purpose. We will present a system that enables data mining applications to run in parallel on networks of workstations in a fault-tolerant manner. We will describe our parallelization of a combinatorial pattern discovery algorithm and a classification tree algorithm. We will demonstrate the effectiveness of our system with two real applications: discovering active motifs in protein sequences and predicting foreign exchange rate movement. SIGMOD Conference Random Sampling for Histogram Construction: How much is enough? Surajit Chaudhuri,Rajeev Motwani,Vivek R. Narasayya 1998 Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions. SIGMOD Conference "AutoAdmin 'What-if' Index Analysis Utility." Surajit Chaudhuri,Vivek R. Narasayya 1998 "AutoAdmin 'What-if' Index Analysis Utility." SIGMOD Conference Microsoft Index Tuning Wizard for SQL Server 7.0. Surajit Chaudhuri,Vivek R. Narasayya 1998 Microsoft Index Tuning Wizard for SQL Server 7.0. SIGMOD Conference A Protein Patent Query System Powered By Kleisli. Jing Chen,Limsoon Wong,Louxin Zhang 1998 A Protein Patent Query System Powered By Kleisli. SIGMOD Conference Changing the Rules: Transformations for Rule-Based Optimizers. Mitch Cherniack,Stanley B. Zdonik 1998 Rule-based optimizers are extensible because they consist of modifiable sets of rules. For modification to be straightforward, rules must be easily reasoned about (i.e., understood and verified). At the same time, rules must be expressive and efficient (to fire) for rule-based optimizers to be practical. Production-style rules (as in [15]) are expressed with code and are hard to reason about. Pure rewrite rules (as in [1]) lack code, but cannot atomically express complex transformations (e.g., normalizations). Some systems allow rules to be grouped, but sacrifice efficiency by providing limited control over their firing. Therefore, none of these approaches succeeds in making rules expressive, efficient and understandable. We propose a language (COKO) for expressing an alternative form of input to a rule-based optimizer. A COKO transformation consists of a set of declarative (KOLA) rewrite rules and a (firing) algorithm that specifies their firing. It is straightforward to reason about COKO transformations because all query modification is expressed with declarative rewrite rules. Firing is specified algorithmically with an expressive language that provides direct control over how query representations are traversed, and under what conditions rules are fired. Therefore, COKO achieves a delicate balance of understandability, efficiency and expressivity. SIGMOD Conference Real Business Processing Meets the Web. James Chong 1998 Charles Schwab and Co, Inc. is a major web trader generating a large proportion of its revenue from the Web. That revenue is based on both having a site with lots of useful facilities and also the speed of execution, ability to cope with peaks in demand volumes, and the reliability of the site and its underlying services. James Chong, VP Architecture and Planning at Schwab, will talk about the fundamental infrastructure that supports the Web trading, and his plans for its evolution. SIGMOD Conference Java and Relational Databases: SQLJ (Tutorial). Gray Clossman,Phil Shaw,Mark Hapner,Johannes Klein,Richard Pledereder,Brian Becker 1998 Java and Relational Databases: SQLJ (Tutorial). SIGMOD Conference Your Mediators Need Data Conversion! Sophie Cluet,Claude Delobel,Jérôme Siméon,Katarzyna Smaga 1998 Due to the development of the World Wide Web, the integration of heterogeneous data sources has become a major concern of the database community. Appropriate architectures and query languages have been proposed. Yet, the problem of data conversion which is essential for the development of mediators/wrappers architectures has remained largely unexplored. In this paper, we present the YAT system for data conversion. This system provides tools for the specification and the implementation of data conversions among heterogeneous data sources. It relies on a middleware model, a declarative language, a customization mechanism and a graphical interface. The model is based on named trees with ordered and labeled nodes. Like semistructured data models, it is simple enough to facilitate the representation of any data. Its main originality is that it allows to reason at various levels of representation. The YAT conversion language (called YATL) is declarative, rule-based and features enhanced pattern matching facilities and powerful restructuring primitives. It allows to preserve or reconstruct the order of collections. The customization mechanism relies on program instantiations: an existing program may be instantiated into a more specific one, and then easily modified. We also present the architecture, implementation and practical use of the YAT prototype, currently under evaluation within the OPAL* project. SIGMOD Conference Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity. William W. Cohen 1998 "Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second." SIGMOD Conference Providing Database-like Access to the Web Using Queries Based on Textual Similarity. William W. Cohen 1998 Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. Here we assume instead that the names are given in natural language text. We then propose a logic for database integration called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. An implemented data integration system based on WHIRL has been used to successfully integrate information from several dozen Web sites in two domains. SIGMOD Conference "DTL's DataSpot: Database Exploration as Easy as Browsing the Web ..." Shaul Dar,Gadi Entin,Shai Geva,Eran Palmon 1998 "DTL's DataSpot: Database Exploration as Easy as Browsing the Web ..." SIGMOD Conference Caching Multidimensional Queries Using Chunks. Prasad Deshpande,Karthikeyan Ramasamy,Amit Shukla,Jeffrey F. Naughton 1998 Caching has been proposed (and implemented) by OLAP systems in order to reduce response times for multidimensional queries. Previous work on such caching has considered table level caching and query level caching. Table level caching is more suitable for static schemes. On the other hand, query level caching can be used in dynamic schemes, but is too coarse for “large” query results. Query level caching has the further drawback for small query results in that it is only effective when a new query is subsumed by a previously cached query. In this paper, we propose caching small regions of the multidimensional space called “chunks”. Chunk-based caching allows fine granularity caching, and allows queries to partially reuse the results of previous queries with which they overlap. To facilitate the computation of chunks required by a query but missing from the cache, we propose a new organization for relational tables, which we call a “chunked file.” Our experiments show that for workloads that exhibit query locality, chunked caching combined with the chunked file organization performs better than query level caching. An unexpected benefit of the chunked file organization is that, due to its multidimensional clustering properties, it can significantly improve the performance of queries that “miss” the cache entirely as compared to traditional file organizations. SIGMOD Conference Database Systems Management and Oracle8. C. Gregory Doherty 1998 "Oracle's corporate mission is to enable the Information Age through network computing, a vision of broader access to information for all and the empowerment and increased productivity that can result. The technology implications of the network computing vision are ubiquitous access via low-cost appliances to smaller numbers of larger databases, accessed via professionally managed networks compliant with open internetworking protocols. The latest release of the Oracle data server, Oracle8, provides new technology for management of very large databases containing rich and user-defined data types, and is continuing to evolve to make it economically beneficial to store all forms of digital information in a database." SIGMOD Conference Managing Large Systems with DB2 UDB. Chris Eaton 1998 "In this talk, we will describe the usability challenges facing large distributed corporations. As well, we will discuss what IBM's DB2 Universal Database is doing to address these complex issues." SIGMOD Conference Catching the Boat with Strudel: Experiences with a Web-Site Management System. Mary F. Fernández,Daniela Florescu,Jaewoo Kang,Alon Y. Levy,Dan Suciu 1998 "The Strudel system applies concepts from database management systems to the process of building Web sites. Strudel's key idea is separating the management of the site's data, the creation and management of the site's structure, and the visual presentation of the site's pages. First, the site builder creates a uniform model of all data available at the site. Second, the builder uses this model to declaratively define the Web site's structure by applying a “site-definition query” to the underlying data. The result of evaluating this query is a “site graph”, which represents both the site's content and structure. Third, the builder specifies the visual presentation of pages in Strudel's HTML-template language. The data model underlying Strudel is a semi-structured model of labeled directed graphs. We describe Strudel's key characteristics, report on our experiences using Strudel, and present the technical problems that arose from our experience. We describe our experience constructing several Web sites with Strudel and discuss the impact of potential users' requirements on Strudel's design. We address two main questions: (1) when does a declarative specification of site structure provide significant benefits, and (2) what are the main advantages provided by the semi-structured data model." SIGMOD Conference Query Unnesting in Object-Oriented Databases. Leonidas Fegaras 1998 There is already a sizable body of proposals on OODB query optimization. One of the most challenging problems in this area is query unnesting, where the embedded query can take any form, including aggregation and universal quantification. Although there is already a number of proposed techniques for query unnesting, most of these techniques are applicable to only few cases. We believe that the lack of a general and simple solution to the query unnesting problem is due to the lack of a uniform algebra that treats all operations (including aggregation and quantification) in the same way. This paper presents a new query unnesting algorithm that generalizes many unnesting techniques proposed recently in the literature. Our system is capable of removing any form of query nesting using a very simple and efficient algorithm. The simplicity of the system is due to the use of the monoid comprehension calculus as an intermediate form for OODB queries. The monoid comphrehension calculus treats operations over multiple collection types, aggregates, and quantifiers in a similar way, resulting in a uniform way of unnesting queries, regardless of their type of nesting. SIGMOD Conference """Data In Your Face"": Push Technology in Perspective." Michael J. Franklin,Stanley B. Zdonik 1998 """Data In Your Face"": Push Technology in Perspective." SIGMOD Conference The DEDALE System for Complex Spatial Queries. Stéphane Grumbach,Philippe Rigaux,Luc Segoufin 1998 This paper presents DEDALE, a spatial database system intended to overcome some limitations of current systems by providing an abstract and non-specialized data model and query language for the representation and manipulation of spatial objects. DEDALE relies on a logical model based on linear constraints, which generalizes the constraint database model of [KKR90]. While in the classical constraint model, spatial data is always decomposed into its convex components, in DEDALE holes are allowed to fit the need of practical applications. The logical representation of spatial data although slightly more costly in memory, has the advantage of simplifying the algorithms. DEDALE relies on nested relations, in which all sorts of data (thematic, spatial, etc.) are stored in a uniform fashion. This new data model supports declarative query languages, which allow an intuitive and efficient manipulation of spatial objects. Their formal foundation constitutes a basis for practical query optimization. We describe several evaluation rules tailored for geometric data and give the specification of an optimizer module for spatial queries. Except for the latter module, the system has been fully implemented upon the O2 DBMS, thus proving the effectiveness of a constraint-based approach for the design of spatial database systems. SIGMOD Conference New Sampling-Based Summary Statistics for Improving Approximate Query Answers. Phillip B. Gibbons,Yossi Matias 1998 In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse. SIGMOD Conference CURE: An Efficient Clustering Algorithm for Large Databases. Sudipto Guha,Rajeev Rastogi,Kyuseok Shim 1998 Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality. SIGMOD Conference Secure and Portable Database Extensibility. Michael W. Godfrey,Tobias Mayr,Praveen Seshadri,Thorsten von Eicken 1998 "The functionality of extensible database servers can be augmented by user-defined functions (UDFs). However, the server's security and stability are concerns whenever new code is incorporated. Recently, there has been interest in the use of Java for database extensibility. This raises several questions: Does Java solve the security problems? How does it affect efficiency? We explore the tradeoffs involved in extending the PREDATOR object-relational database server using Java. We also describe some interesting details of our implementation. The issues examined in our study are security, efficiency, and portability. Our performance experiments compare Java-based extensibility with traditional alternatives in the native language of the server. We explore a variety of UDFs that differ in the amount of computation involved and in the quantity of data accessed. We also qualitatively compare the security and portability of the different alternatives. Our conclusion is that Java-based UDFs are a viable approach in terms of performance. However, there may be challenging design issues in integrating Java UDFs with existing database systems." SIGMOD Conference Incremental Distance Join Algorithms for Spatial Databases. Gísli R. Hjaltason,Hanan Samet 1998 Two new spatial join operations, distance join and distance semi-join, are introduced where the join output is ordered by the distance between the spatial attribute values of the joined tuples. Incremental algorithms are presented for computing these operations, which can be used in a pipelined fashion, thereby obviating the need to wait for their completion when only a few tuples are needed. The algorithms can be used with a large class of hierarchical spatial data structures and arbitrary spatial data types in any dimensions. In addition, any distance metric may be employed. A performance study using R-trees shows that the incremental algorithms outperform non-incremental approaches by an order of magnitude if only a small part of the result is needed, while the penalty, if any, for the incremental processing is modest if the entire join result is required. SIGMOD Conference On Parallel Processing of Aggregate and Scalar Functions in Object-Relational DBMS. Michael Jaedicke,Bernhard Mitschang 1998 Nowadays parallel object-relational DBMS are envisioned as the next great wave, but there is still a lack of efficient implementation concepts for some parts of the proposed functionality. Thus one of the current goals for parallel object-relational DBMS is to move towards higher performance. In this paper we develop a framework that allows to process user-defined functions with data parallelism. We will describe the class of partitionable functions that can be processed parallelly. We will also propose an extension which allows to speed up the processing of another large class of functions by means of parallel sorting. Functions that can be processed by means of our techniques are often used in decision support queries on large data volumes, for example. Hence a parallel execution is indispensable. SIGMOD Conference Interaction of Query Evaluation and Buffer Management for Information Retrieval. Björn Þór Jónsson,Michael J. Franklin,Divesh Srivastava 1998 The proliferation of the World Wide Web has brought information retrieval (IR) techniques to the forefront of search technology. To the average computer user, “searching” now means using IR-based systems for finding information on the WWW or in other document collections. IR query evaluation methods and workloads differ significantly from those found in database systems. In this paper, we focus on three such differences. First, due to the inherent fuzziness of the natural language used in IR queries and documents, an additional degree of flexibility is permitted in evaluating queries. Second, IR query evaluation algorithms tend to have access patterns that cause problems for traditional buffer replacement policies. Third, IR search is often an iterative process, in which a query is repeatedly refined and resubmitted by the user. Based on these differences, we develop two complementary techniques to improve the efficiency of IR queries: 1) Buffer-aware query evaluation, which alters the query evaluation process based on the current contents of buffers; and 2) Ranking-aware buffer replacement, which incorporates knowledge of the query processing strategy into replacement decisions. In a detailed performance study we show that using either of these techniques yields significant performance benefits and that in many cases, combining them produces even further improvements. SIGMOD Conference Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. Navin Kabra,David J. DeWitt 1998 For a number of reasons, even the best query optimizers can very often produce sub-optimal query execution plans, leading to a significant degradation of performance. This is especially true in databases used for complex decision support queries and/or object-relational databases. In this paper, we describe an algorithm that detects sub-optimality of a query execution plan during query execution and attempts to correct the problem. The basic idea is to collect statistics at key points during the execution of a complex query. These statistics are then used to optimize the execution of the query, either by improving the resource allocation for that query, or by changing the execution plan for the remainder of the query. To ensure that this does not significantly slow down the normal execution of a query, the Query Optimizer carefully chooses what statistics to collect, when to collect them, and the circumstances under which to re-optimize the query. We describe an implementation of this algorithm in the Paradise Database System, and we report on performance studies, which indicate that this can result in significant improvements in the performance of complex queries. SIGMOD Conference Dimensionality Reduction for Similarity Searching in Dynamic Databases. Kothuri Venkata Ravi Kanth,Divyakant Agrawal,Ambuj K. Singh 1998 Databases are increasingly being used to store multi-media objects such as maps, images, audio and video. Storage and retrieval of these objects is accomplished using multi-dimensional index structures such as R*-trees and SS-trees. As dimensionality increases, query performance in these index structures degrades. This phenomenon, generally referred to as the dimensionality curse, can be circumvented by reducing the dimensionality of the data. Such a reduction is however accompanied by a loss of precision of query results. Current techniques such as QBIC use SVD transform-based dimensionality reduction to ensure high query precision. The drawback of this approach is that SVD is expensive to compute, and therefore not readily applicable to dynamic databases. In this paper, we propose novel techniques for performing SVD-based dimensionality reduction in dynamic databases. When the data distribution changes considerably so as to degrade query precision, we recompute the SVD transform and incorporate it in the existing index structure. For recomputing the SVD-transform, we propose a novel technique that uses aggregate data from the existing index rather than the entire data. This technique reduces the SVD-computation time without compromising query precision. We then explore efficient ways to incorporate the recomputed SVD-transform in the existing index structure without degrading subsequent query response times. These techniques reduce the computation time by a factor of 20 in experiments on color and texture image vectors. The error due to approximate computation of SVD is less than 10%. SIGMOD Conference SAP R/3: A Database Application System (Tutorial). Alfons Kemper,Donald Kossmann,Florian Matthes 1998 SAP R/3: A Database Application System (Tutorial). SIGMOD Conference Microsoft.com: A High-Scale Data Management and Transaction Processing Solution. Sherri Kennamer 1998 "Microsoft.com, is the world's largest corporate website both in terms of site visitors and pages served. Overall, it is the fourth-largest website in total visitors behind America Online, Yahoo and Netscape. We offer 250,000 pages of content, viewable in all major browser versions (yes, we aggressively support Netscape), supported by three server farms internationally and featuring content updated as often as every three hours, seven days a week." SIGMOD Conference FlowBack: Providing Backward Recovery for Workflow Systems. Bartek Kiepuszewski,Ralf Mühlberger,Maria E. Orlowska 1998 FlowBack: Providing Backward Recovery for Workflow Systems. SIGMOD Conference amdb: An Access Method Debugging Tool. Marcel Kornacker,Mehul A. Shah,Joseph M. Hellerstein 1998 amdb: An Access Method Debugging Tool. SIGMOD Conference An Alternative Storage Organization for ROLAP Aggregate Views Based on Cubetrees. Yannis Kotidis,Nick Roussopoulos 1998 The Relational On-Line Analytical Processing (ROLAP) is emerging as the dominant approach in data warehousing with decision support applications. In order to enhance query performance, the ROLAP approach relies on selecting and materializing in summary tables appropriate subsets of aggregate views which are then engaged in speeding up OLAP queries. However, a straight forward relational storage implementation of materialized ROLAP views is immensely wasteful on storage and incredibly inadequate on query performance and incremental update speed. In this paper we propose the use of Cubetrees, a collection of packed and compressed R-trees, as an alternative storage and index organization for ROLAP views and provide an efficient algorithm for mapping an arbitrary set of OLAP views to a collection of Cubetrees that achieve excellent performance. Compared to a conventional (relational) storage organization of materialized OLAP views, Cubetrees offer at least a 2-1 storage reduction, a 10-1 better OLAP query performance, and a 100-1 faster updates. We compare the two alternative approaches with data generated from the TPC-D benchmark and stored in the Informix Universal Server (IUS). The straight forward implementation materializes the ROLAP views using IUS tables and conventional B-tree indexing. The Cubetree implementation materializes the same ROLAP views using a Cubetree Datablade developed for IUS. The experiments demonstrate that the Cubetree storage organization is superior in storage, query performance and update speed. SIGMOD Conference User-oriented smart-cache for the Web: What You Seek is What You Get! Zoé Lacroix,Arnaud Sahuguet,Raman Chandrasekar 1998 "Standard database approaches to querying information on the Web focus on the source(s) and provide a query language based on a given predefined organization (schema) of the data: this is the source-driven approach. However, can the Web be seen as a standard database? There is no super-user in charge of monitoring the source(s) (the data is constantly updated), there is no homogeneous structure (no common explicit structure thus), the Web itself never stops growing, etc. For these reasons, we believe that the source-driven standard approach is not suitable to the Web. As an alternative, we propose a user-oriented approach based on the idea that the schema is a posteriori expressed by the user's needs when asking a query. Given a user query, AKIRA (Agentive Knowledge-based Information Retrieval Architecture) [6] extracts a target structure (structure expressed in the query) and uses standard information retrieval and filtering techniques to access potentially relevant documents. The user-oriented paradigm means that the structure through which the data is viewed does not come from the source but is extracted from the user query. When a user asks a query, the relevant information is retrieved from the Web and stored as is in a cache. Then the information is extracted from the raw data using computational linguistic techniques. The AKIRA cache (smart-cache) represents these extracted layers of meta-information on top of the raw data. The smart-cache is an object-oriented database whose schema is inferred from the user's target structure. It is designed on demand through a library of concepts that can be assembled together to match concepts and meta-concepts required in the user's query. The smart cache can be seen as a view of the Web. To the best of our knowledge, AKIRA is the only system that uses information retrieval and extraction integrated with database techniques to provide maximum flexibility to the user and offer transparent access to the content of Web documents." SIGMOD Conference 50,000 Users on an Oracle8 Universal Server Database. Tirthankar Lahiri,Ashok Joshi,Amit Jasuja,Sumanta Chatterjee 1998 In this paper, we describe the Oracle Large User Population Demonstration and highlight the scalability mechanisms in the Oracle8 Universal Data Server which make it possible to support as many as 50,000 concurrent users on a single Oracle8 database without any middle-tier TP-monitor software. Supporting such large user populations requires many mechanisms for high concurrency and throughput. Algorithms in all areas of the server ranging from process and buffer management to SQL compilation and execution must be designed to be highly scalable. Efficient resource sharing mechanisms are required to prevent server-side resource requirements from growing unboundedly with the number of users. Parallel execution across multiple systems is necessary to allow user-population and throughput to scale beyond the restrictions of a single system. In addition to scalability, mechanisms for high availability, ease-of-use, and rich functionality are necessary for supporting complex user applications typical of realistic workloads. All mechanisms must be portable to a wide variety of installations ranging from desk-top systems to large scale enterprise servers and to a wide variety of operating systems. SIGMOD Conference Memory Management During Run Generation in External Sorting. Per-Åke Larson,Goetz Graefe 1998 "If replacement selection is used in an external mergesort to generate initial runs, individual records are deleted and inserted in the sort operation's workspace. Variable-length records introduce the need for possibly complex memory management and extra copying of records. As a result, few systems employ replacement selection, even though it produces longer runs than commonly used algorithms. We experimentally compared several algorithms and variants for managing this workspace. We found that the simple best fit algorithm achieves memory utilization of 90% or better and run lengths over 1.8 times workspace size, with no extra copying of records and very little other overhead, for widely varying record sizes and for a wide range of memory sizes. Thus, replacement selection is a viable algorithm for commercial database systems, even for variable-length records. Efficient memory management also enables an external sort algorithm that degrades gracefully when its input is only slightly larger than or a small multiple of the available memory size. This is not the case with the usual implementations of external sorting, which incur I/O for the entire input even if it is as little as one record larger than memory. Thus, in some cases, our techniques may reduce I/O volume by a factor 10 compared to traditional database sort algorithms. Moreover, the gradual rather than abrupt growth in I/O volume for increasing input sizes significantly eases design and implementation of intra-query memory management policies." SIGMOD Conference Olympic Records for Data at the 1998 Nagano Games. Edwin R. Lassettre 1998 "The 1998 Nagano Olympic games had more intensive demands on data management than any previous Olympics in history. This talk will take you behind the scenes to talk about the technical challenges and the architectures that made it possible to handle 4.5 Terabytes of data and sustain a total of almost 650 million web requests, reaching a peak of over 103K per minute. We will discuss the overall structure of the most comprehensive and heavily used Internet technology application in history. Many products were involved, both hardware and software, but this talk will focus in on the database and web challenges, the technology that made it possible to support this tremendous workload. High availability, data integrity, high performance, support of both SMPs and clustered architectures were among the features and functions that were critical. We will cover the Olympic Results System, the Commentator Information System, Info '98, Games Management, and the Olympic web site that made this information available to the Internet community. The speaker will be Ed Lassettre, IBM Fellow, and a key member of IBM's Olympic team." SIGMOD Conference Efficient and Transparent Application Recovery in Client-Server Information Systems. David B. Lomet,Gerhard Weikum 1998 Efficient and Transparent Application Recovery in Client-Server Information Systems. SIGMOD Conference Capability Based Mediation in TSIMMIS. Chen Li,Ramana Yerneni,Vasilis Vassalos,Hector Garcia-Molina,Yannis Papakonstantinou,Jeffrey D. Ullman,Murty Valiveti 1998 Capability Based Mediation in TSIMMIS. SIGMOD Conference CQ: A Personalized Update Monitoring Toolkit. Ling Liu,Calton Pu,Wei Tang,David Buttler,John Biggs,Tong Zhou,Paul Benninghoff,Wei Han,Fenghua Yu 1998 The CQ project at OGI, funded by DARPA, aims at developing a scalable toolkit and techniques for update monitoring and event-driven information delivery on the net. The main feature of the CQ project is a “personalized update monitoring” toolkit based on continual queries [3]. Comparing with the pure pull (such as DBMSs, various web search engines) and pure push (such as Pointcast, Marimba, Broadcast disks) technology, the CQ project can be seen as a hybrid approach that combines the pull and push technology by supporting personalized update monitoring through a combined client-pull and server-push paradigm. SIGMOD Conference Wavelet-Based Histograms for Selectivity Estimation. Yossi Matias,Jeffrey Scott Vitter,Min Wang 1998 Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P, we need to estimate the fraction of records in the database that satisfy P. Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations. In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data distributions give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using them in an on-line fashion for selectivity estimation. Our histograms also provide quick approximate answers to OLAP queries when the exact answers are not required. Our method captures the joint distribution of multiple attributes effectively, even when the attributes are correlated. Experiments confirm that our histograms offer substantial improvements in accuracy over random sampling and other previous approaches. SIGMOD Conference The Araneus Web-Base Management System. Giansalvatore Mecca,Paolo Atzeni,Alessandro Masci,Paolo Merialdo,Giuseppe Sindoni 1998 The Araneus Web-Base Management System. SIGMOD Conference Using Schematically Heterogeneous Structures. Renée J. Miller 1998 Schematic heterogeneity arises when information that is represented as data under one schema, is represented within the schema (as metadata) in another. Schematic heterogeneity is an important class of heterogeneity that arises frequently in integrating legacy data in federated or data warehousing applications. Traditional query languages and view mechanisms are insufficient for reconciling and translating data between schematically heterogeneous schemas. Higher order query languages, that permit quantification over schema labels, have been proposed to permit querying and restructuring of data between schematically disparate schemas. We extend this work by considering how these languages can be used in practice. Specifically, we consider a restricted class of higher order views and show the power of these views in integrating legacy structures. Our results provide insights into the properties of restructuring transformations required to resolve schematic discrepancies. In addition, we show how the use of these views permits schema browsing and new forms of data independence that are important for global information systems. Furthermore, these views provide a framework for integrating semi-structured and unstructured queries, such as keyword searches, into a structured querying environment. We show how these views can be used with minimal extensions to existing query engines. We give conditions under which a higher order view is usable for answering a query and provide query translation algorithms. SIGMOD Conference "Panel on Next Generation Database Systems Won't Work Without Semantics!" John Mylopoulos 1998 "Panel on Next Generation Database Systems Won't Work Without Semantics!" SIGMOD Conference Developing a High Traffic, Read-Only Web Site. John Nauman,Ray Suorsa 1998 In this paper, we describe some of the considerations for designing highly trafficked web sites with read-only or read mostly characteristics. SIGMOD Conference Extracting Schema from Semistructured Data. Svetlozar Nestorov,Serge Abiteboul,Rajeev Motwani 1998 Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results. SIGMOD Conference Exploratory Mining and Pruning Optimizations of Constrained Association Rules. Raymond T. Ng,Laks V. S. Lakshmanan,Jiawei Han,Alex Pang 1998 From the standpoint of supporting human-centered discovery of knowledge, the present-day model of mining association rules suffers from the following serious shortcomings: (i) lack of user exploration and control, (ii) lack of focus, and (iii) rigid notion of relationships. In effect, this model functions as a black-box, admitting little user interaction in between. We propose, in this paper, an architecture that opens up the black-box, and supports constraint-based, human-centered exploratory mining of associations. The foundation of this architecture is a rich set of constraint constructs, including domain, class, and SQL-style aggregate constraints, which enable users to clearly specify what associations are to be mined. We propose constrained association queries as a means of specifying the constraints to be satisfied by the antecedent and consequent of a mined association. In this paper, we mainly focus on the technical challenges in guaranteeing a level of performance that is commensurate with the selectivities of the constraints in an association query. To this end, we introduce and analyze two properties of constraints that are critical to pruning: anti-monotonicity and succinctness. We then develop characterizations of various constraints into four categories, according to these properties. Finally, we describe a mining algorithm called CAP, which achieves a maximized degree of pruning for all categories of constraints. Experimental results indicate that CAP can run much faster, in some cases as much as 80 times, than several basic algorithms. This demonstrates how important the succinctness and anti-monotonicity properties are, in delivering the performance guarantee. SIGMOD Conference A Data Mining Application: Customes Retention at the Port of Singapore Authority (PSA). KianSing Ng,Huan Liu,HweeBong Kwah 1998 "“Customer retention” is an important real-world problem in many sales and services related industries today. This work illustrates how we can integrate the various techniques of data-mining, such as decision-tree induction, deviation analysis and multiple concept-level association rules to form an intuitive and novel approach to gauging customer's loyalty and predicting their likelihood of defection. Immediate action taken against these “early-warnings” is often the key to the eventual retention or loss of the customers involved." SIGMOD Conference DataSplash. Chris Olston,Allison Woodruff,Alexander Aiken,Michael Chu,Vuk Ercegovac,Mark Lin,Mybrid Spalding,Michael Stonebraker 1998 Database visualization is an area of growing importance as database systems become larger and more accessible. DataSplash is an easy-to-use, integrated environment for navigating, creating, and querying visual representations of data. We will demonstrate the three main components which make up the DataSplash environment: a navigation system, a direct-manipulation interface for creating and modifying visualizations, and a direct-manipulation visual query system. SIGMOD Conference Similarity Query Processing Using Disk Arrays. Apostolos Papadopoulos,Yannis Manolopoulos 1998 Similarity queries are fundamental operations that are used extensively in many modern applications, whereas disk arrays are powerful storage media of increasing importance. The basic trade-off in similarity query processing in such a system is that increased parallelism leads to higher resource consumptions and low throughput, whereas low parallelism leads to higher response times. Here, we propose a technique which is based on a careful investigation of the currently available data in order to exploit parallelism up to a point, retaining low response times during query processing. The underlying access method is a variation of the R*-tree, which is distributed among the components of a disk array, whereas the system is simulated using event-driven simulation. The performance results conducted, demonstrate that the proposed approach outperforms by factors a previous branch-and-bound algorithm and a greedy algorithm which maximizes parallelism as much as possible. Moreover, the comparison of the proposed algorithm to a hypothetical (non-existing) optimal one (with respect to the number of disk accesses) shows that the former is on average two times slower than the latter. SIGMOD Conference Xmas: An Extensible Main-Memory Storage System for High-Performance Applications. Jang Ho Park,Yong Sik Kwon,Ki Hong Kim,Sangho Lee,Byoung Dae Park,Sang Kyun Cha 1998 Xmas is an extensible main-memory storage system for high-performance embedded database applications. Xmas not only provides the core functionality of DBMS, such as data persistence, crash recovery, and concurrency control, but also pursues an extensible architecture to meet the requirements from various application areas. One crucial aspect of such extensibility is that an application developer can compose application-specific, high-level operations with a basic set of operations provided by the system. Called composite actions in Xmas, these operations are processed by a customized Xmas server with minimum interaction with application processes, thus improving the overall performance. This paper first presents the architecture and functionality of Xmas, and then demonstrates a simulation of mobile communication service. SIGMOD Conference Approximate Medians and other Quantiles in One Pass and with Limited Memory. Gurmeet Singh Manku,Sridhar Rajagopalan,Bruce G. Lindsay 1998 We present new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit, and apply for arbitrary value distributions and arrival distributions of the dataset. The main memory requirements are smaller than those reported earlier by an order of magnitude. We also discuss methods that couple the approximation algorithms with random sampling to further reduce memory requirements. With sampling, the approximation guarantees are explicit but probabilistic, i.e. they apply with respect to a (user controlled) confidence parameter. We present the algorithms, their theoretical analysis and simulation results on different datasets. SIGMOD Conference The PointCast Network. Satish Ramakrishnan,Vibha Dayal 1998 The PointCast Network. SIGMOD Conference Reusing Invariants: A New Strategy for Correlated Queries. Jun Rao,Kenneth A. Ross 1998 Correlated queries are very common and important in decision support systems. Traditional nested iteration evaluation methods for such queries can be very time consuming. When they apply, query rewriting techniques have been shown to be much more efficient. But query rewriting is not always possible. When query rewriting does not apply, can we do something better than the traditional nested iteration methods? In this paper, we propose a new invariant technique to evaluate correlated queries efficiently. The basic idea is to recognize the part of the subquery that is not related to the outer references and cache the result of that part after its first execution. Later, we can reuse the result and combine it with the result of the rest of the subquery that is changing for each iteration. Our technique applies to arbitrary correlated subqueries. This paper introduces algorithms to recognize the invariant part of a data flow tree, and to restructure the evaluation plan to reuse the stored intermediate result. We also propose an efficient method to teach an existing join optimizer to understand the invariant feature and thus allow it to be able to generate better join plans in the new context. Some other related optimization techniques are also discussed. The proposed techniques were implemented within three months on an existing real commercial database system. We also experimentally evaluate our proposed technique. Our evaluation indicates that, when query rewriting is not possible, the invariant technique is significantly better than the traditional nested iteration method. Even when query rewriting applies, the invariant technique is sometimes better than the query rewriting technique. Our conclusion is that the invariant technique should be considered as one of the alternatives in evaluating correlated queries since it fills the gap left by rewriting techniques. SIGMOD Conference SQL Open Heterogeneous Data Access. Berthold Reinwald,Hamid Pirahesh 1998 We describe the open, extensible architecture of SQL for accessing data stored in external data sources not managed by the SQL engine. In this scenario, SQL engines act as middleware servers providing access to external data using SQL DML statements and joining external data with SQL tables in heterogeneous queries. We describe the state-of-the art in object-relational systems and their companion products, and provide an outlook on future directions. SIGMOD Conference Optimal Multi-Step k-Nearest Neighbor Search. Thomas Seidl,Hans-Peter Kriegel 1998 For an increasing number of modern database applications, efficient support of similarity search becomes an important task. Along with the complexity of the objects such as images, molecules and mechanical parts, also the complexity of the similarity models increases more and more. Whereas algorithms that are directly based on indexes work well for simple medium-dimensional similarity distance functions, they do not meet the efficiency requirements of complex high-dimensional and adaptable distance functions. The use of a multi-step query processing strategy is recommended in these cases, and our investigations substantiate that the number of candidates which are produced in the filter step and exactly evaluated in the refinement step is a fundamental efficiency parameter. After revealing the strong performance shortcomings of the state-of-the-art algorithm for k-nearest neighbor search [Korn et al. 1996], we present a novel multi-step algorithm which is guaranteed to produce the minimum number of candidates. Experimental evaluations demonstrate the significant performance gain over the previous solution, and we observed average improvement factors of up to 120 for the number of candidates and up to 48 for the total runtime. SIGMOD Conference Integrating Mining with Relational Database Systems: Alternatives and Implications. Sunita Sarawagi,Shiby Thomas,Rakesh Agrawal 1998 Integrating Mining with Relational Database Systems: Alternatives and Implications. SIGMOD Conference Parallel Mining Algorithms for Generalized Association Rules with Classification Hierarchy. Takahiko Shintani,Masaru Kitsuregawa 1998 Association rule mining recently attracted strong attention. Usually, the classification hierarchy over the data items is available. Users are interested in generalized association rules that span different levels of the hierarchy, since sometimes more interesting rules can be derived by taking the hierarchy into account. In this paper, we propose the new parallel algorithms for mining association rules with classification hierarchy on a shared-nothing parallel machine to improve its performance. Our algorithms partition the candidate itemsets over the processors, which exploits the aggregate memory of the system effectively. If the candidate itemsets are partitioned without considering classification hierarchy, both the items and its all the ancestor items have to be transmitted, that causes prohibitively large amount of communications. Our method minimizes interprocessor communication by considering the hierarchy. Moreover, in our algorithm, the available memory space is fully utilized by identifying the frequently occurring candidate itemsets and copying them over all the processors, through which frequent itemsets can be processed locally without any communication. Thus it can effectively reduce the load skew among the processors. Several experiments are done by changing the granule of copying itemsets, from the whole tree, to the small group of the frequent itemsets along the hierarchy. The coarser the grain, the easier the control but it is rather difficult to achieve the sufficient load balance. The finer the grain, the more complicated the control is required but it can balance the load quite well. We implemented proposed algorithms on IBM SP-2. Performance evaluations show that our algorithms are effective for handling skew and attain sufficient speedup ratio. SIGMOD Conference Ubiquitous, Self-tuning, Scalable Servers. Peter M. Spiro 1998 Hardware developments allow wonderful reliability and essentially limitless capabilities in storage, networks, memory, and processing power. Costs have dropped dramatically. PCs are becoming ubiquitous. The features and scalability of DBMS software have advanced to the point where most commercial systems can solve virtually all OLTP and DSS requirements. The Internet and application software packages allow rapid deployment and facilitate a broad range of solutions. SIGMOD Conference Are We Working On the Right Problems? (Panel). Michael Stonebraker 1998 There appears to be a discrepancy between the research topics being pursued by the database research community and the key problems facing information systems decisions makers such as Chief Information Officers (CIOs). Panelists will present their view of the key problems that would benefit from a research focus in the database research community and will discuss perceived discrepancies. Based on personal experience, the most commonly discussed information systems problems facing CIOs today include: SIGMOD Conference Cost-Based Optimization of Decision Support Queries Using Transient Views. Subbu N. Subramanian,Shivakumar Venkataraman 1998 Next generation decision support applications, besides being capable of processing huge amounts of data, require the ability to integrate and reason over data from multiple, heterogeneous data sources. Often, these data sources differ in a variety of aspects such as their data models, the query languages they support, and their network protocols. Also, typically they are spread over a wide geographical area. The cost of processing decision support queries in such a setting is quite high. However, processing these queries often involves redundancies such as repeated access of same data source and multiple execution of similar processing sequences. Minimizing these redundancies would significantly reduce the query processing cost. In this paper, we (1) propose an architecture for processing complex decision support queries involving multiple, heterogeneous data sources; (2) introduce the notion of transient-views — materialized views that exist only in the context of execution of a query — that is useful for minimizing the redundancies involved in the execution of these queries; (3) develop a cost-based algorithm that takes a query plan as input and generates an optimal “covering plan”, by minimizing redundancies in the original plan; (4) validate our approach by means of an implementation of the algorithms and a detailed performance study based on TPC-D benchmark queries on a commercial database system; and finally, (5) compare and contrast our approach with work in related areas, in particular, the areas of answering queries using views and optimization using common sub-expressions. Our experiments demonstrate the practicality and usefulness of transient-views in significantly improving the performance of decision support queries. SIGMOD Conference SuperSQL: An Extended SQL for Database Publishing and Presentation. Motomichi Toyama 1998 SuperSQL is an extension of SQL that allows query results presented in various media for publishing and presentations with simple but sophisticated formatting capabilities. SuperSQL query can generate various kinds of materials, for example, a LaTeX source file to publish query results in a nested table, HTML or Java source files to present the result on WWW browsers, and other media including MS-Excel worksheet, Tcl/Tk, O2C, etc. O2C is a data manipulation language of O2 and thus useful to migrate data in a relational database to an object oriented database. SuperSQL is meant to provide a theoretical and practical foundation for 4GL-type applications such as report writers and DB/WWW coordinators. In this demonstration, we show how TFE reorganize the query results into various media in a universal way, first by grouping tuples according to an arbitrary tree structured schema, and by translating them with the constructors available in the target media. SIGMOD Conference Query Flocks: A Generalization of Association-Rule Mining. Shalom Tsur,Jeffrey D. Ullman,Serge Abiteboul,Chris Clifton,Rajeev Motwani,Svetlozar Nestorov,Arnon Rosenthal 1998 Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the “a-priori” trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called “query flocks,” is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers. SIGMOD Conference Cost Based Query Scrambling for Initial Delays. Tolga Urhan,Michael J. Franklin,Laurent Amsaleg 1998 Remote data access from disparate sources across a wide-area network such as the Internet is problematic due to the unpredictable nature of the communications medium and the lack of knowledge about the load and potential delays at remote sites. Traditional, static, query processing approaches break down in this environment because they are unable to adapt in response to unexpected delays. Query scrambling has been proposed to address this problem. Scrambling modifies query execution plans on-the-fly when delays are encountered during runtime. In its original formulation, scrambling was based on simple heuristics, which although providing good performance in many cases, were also shown to be susceptible to problems resulting from bad scrambling decisions. In this paper we address these shortcomings by investigating ways to exploit query optimization technology to aid in making intelligent scrambling choices. We propose three different approaches to using query optimization for scrambling. These approaches vary, for example, in whether they optimize for total work or response-time, and whether they construct partial or complete alternative plans. Using a two-phase randomized query optimizer, a distributed query processing simulator, and a workload derived from queries of the TPCD benchmark, we evaluate these different approaches and compare their ability to cope with initial delays in accessing remote sources. The results show that cost-based scrambling can effectively hide initial delays, but that in the absence of good predictions of expected delay durations, there are fundamental tradeoffs between risk aversion and effectiveness. SIGMOD Conference FileNet Integrated Document Management Database Usage and Issues. Daniel S. Whelan 1998 "The FileNet Integrated Document Management (IDM) products consists of a family of client applications and Imaging and Electronic Document Management (EDM) services. These services provide robust facilities for document creation, update, and deletion along with document search capabilities. Document properties are stored in an underlying relational database (RDBMS); document content is stored in files or in a specialized optical disk hierarchical storage manager. FileNet Corporation's Visual WorkFlo® and Ensemble® workflow products can be utilized in conjunction with FileNet's IDM technologies to automate production and ad hoc business processes respectively. This talk will discuss how Integrated Document Management requirements affect an IDM system's usage of a RDBMS. Some of the areas to be discussed include:" SIGMOD Conference Enterprise Java Platform Data Access. Seth J. White,R. G. G. Cattell,Sheldon J. Finkelstein 1998 This paper describes alternative methods for data access that are available to developers using the Java™ platform and related technologies to create a new generation of enterprise applications. The paper highlights industry trends and describes Java technologies that are responsible for a new paradigm in data access. Java technology represents a new level of portability, scalability, and ease-of-use for applications that require data access. SIGMOD Conference MultiMediaMiner: A System Prototype for Multimedia Data Mining. Osmar R. Zaïane,Jiawei Han,Ze-Nian Li,Sonny Han Seng Chee,Jenny Chiang 1998 Multimedia data mining is the mining of high-level multimedia information and knowledge from large multimedia databases. A multimedia data mining system prototype, MultiMediaMiner, has been designed and developed. It includes the construction of a multimedia data cube which facilitates multiple dimensional analysis of multimedia data, primarily based on visual content, and the mining of multiple kinds of knowledge, including summarization, comparison, classification, association, and clustering. SIGMOD Conference Simultaneous Optimization and Evaluation of Multiple Dimensional Queries. Yihong Zhao,Prasad Deshpande,Jeffrey F. Naughton,Amit Shukla 1998 Database researchers have made significant progress on several research issues related to multidimensional data analysis, including the development of fast cubing algorithms, efficient schemes for creating and maintaining precomputed group-bys, and the design of efficient storage structures for multidimensional data. However, to date there has been little or no work on multidimensional query optimization. Recently, Microsoft has proposed “OLE DB for OLAP” as a standard multidimensional interface for databases. OLE DB for OLAP defines Multi-Dimensional Expressions (MDX), which have the interesting and challenging feature of allowing clients to ask several related dimensional queries in a single MDX expression. In this paper, we present three algorithms to optimize multiple related dimensional queries. Two of the algorithms focus on how to generate a global plan from several related local plans. The third algorithm focuses on generating a good global plan without first generating local plans. We also present three new query evaluation primitives that allow related query plans to share portions of their evaluation. Our initial performance results suggest that the exploitation of common subtask evaluation and global optimization can yield substantial performance improvements when relational database systems are used as data sources for multidimensional analysis. VLDB Improving Adaptable Similarity Query Processing by Using Approximations. Mihael Ankerst,Bernhard Braunmüller,Hans-Peter Kriegel,Thomas Seidl 1998 Improving Adaptable Similarity Query Processing by Using Approximations. VLDB Scalable Sweeping-Based Spatial Join. Lars Arge,Octavian Procopiuc,Sridhar Ramaswamy,Torsten Suel,Jeffrey Scott Vitter 1998 Scalable Sweeping-Based Spatial Join. VLDB KODA - The Architecture And Interface of a Data Model Independent Kernel. Gopalan Arun,Ashok Joshi 1998 KODA - The Architecture And Interface of a Data Model Independent Kernel. VLDB Incremental Maintenance for Materialized Views over Semistructured Data. Serge Abiteboul,Jason McHugh,Michael Rys,Vasilis Vassalos,Janet L. Wiener 1998 Incremental Maintenance for Materialized Views over Semistructured Data. VLDB DataBlitz: A High Performance Main-Memory Storage Manager. Jerry Baulier,Philip Bohannon,S. Gogate,S. Joshi,C. Gupta,A. Khivesera,Henry F. Korth,Peter McIlroy,J. Miller,P. P. S. Narayan,M. Nemeth,Rajeev Rastogi,Abraham Silberschatz,S. Sudarshan 1998 DataBlitz: A High Performance Main-Memory Storage Manager. VLDB Bulk-Loading Techniques for Object Databases and an Application to Relational Data. Sihem Amer-Yahia,Sophie Cluet,Claude Delobel 1998 Bulk-Loading Techniques for Object Databases and an Application to Relational Data. VLDB Architecture of Oracle Parallel Server. Roger Bamford,D. Butler,Boris Klots,N. MacNaughton 1998 Architecture of Oracle Parallel Server. VLDB A Database System for Real-Time Event Aggregation in Telecommunication. Jerry Baulier,Stephen Blott,Henry F. Korth,Abraham Silberschatz 1998 A Database System for Real-Time Event Aggregation in Telecommunication. VLDB Materialized Views in Oracle. Randall G. Bello,Karl Dias,Alan Downing,James J. Feenan Jr.,James L. Finnerty,William D. Norcott,Harry Sun,Andrew Witkowski,Mohamed Ziauddin 1998 Materialized Views in Oracle. VLDB R-Tree Based Indexing of Now-Relative Bitemporal Data. Rasa Bliujute,Christian S. Jensen,Simonas Saltenis,Giedrius Slivinskas 1998 R-Tree Based Indexing of Now-Relative Bitemporal Data. VLDB Information, Communication, and Money: For What Can We Charge and How Can We Meter It? Stephen Blott,Henry F. Korth,Abraham Silberschatz 1998 Information, Communication, and Money: For What Can We Charge and How Can We Meter It? VLDB The Drill Down Benchmark. Peter A. Boncz,Tim Rühl,Fred Kwakkel 1998 The Drill Down Benchmark. VLDB Reducing the Braking Distance of an SQL Query Engine. Michael J. Carey,Donald Kossmann 1998 Reducing the Braking Distance of an SQL Query Engine. VLDB Bank of America Case Study: The Information Currency Advantage. Felipe Cariño,Mark Jahnke 1998 Bank of America Case Study: The Information Currency Advantage. VLDB Plan-Per-Tuple Optimization Solution - Parallel Execution of Expensive User-Defined Functions. "Felipe Cariño,William O'Connell" 1998 Plan-Per-Tuple Optimization Solution - Parallel Execution of Expensive User-Defined Functions. VLDB Evaluating Functional Joins Along Nested Reference Sets in Object-Relational and Object-Oriented Databases. Reinhard Braumandl,Jens Claußen,Alfons Kemper 1998 Evaluating Functional Joins Along Nested Reference Sets in Object-Relational and Object-Oriented Databases. VLDB Mining Surprising Patterns Using Temporal Description Length. Soumen Chakrabarti,Sunita Sarawagi,Byron Dom 1998 Mining Surprising Patterns Using Temporal Description Length. VLDB Inferring Function Semantics to Optimize Queries. Mitch Cherniack,Stanley B. Zdonik 1998 Inferring Function Semantics to Optimize Queries. VLDB "DTL's DataSpot: Database Exploration Using Plain Language." Shaul Dar,Gadi Entin,Shai Geva,Eran Palmon 1998 "DTL's DataSpot: Database Exploration Using Plain Language." VLDB Issues in Developing Very Large Data Warehouses. Lyman Do,Pamela Drew,Wei Jin,Vish Jumani,David Van Rossum 1998 Issues in Developing Very Large Data Warehouses. VLDB Incremental Clustering for Mining in a Data Warehousing Environment. Martin Ester,Hans-Peter Kriegel,Jörg Sander,Michael Wimmer,Xiaowei Xu 1998 Incremental Clustering for Mining in a Data Warehousing Environment. VLDB Experiences in Federated Databases: From IRO-DB to MIRO-Web. Peter Fankhauser,Georges Gardarin,M. Lopez,J. Muñoz,Anthony Tomasic 1998 Experiences in Federated Databases: From IRO-DB to MIRO-Web. VLDB Computing Iceberg Queries Efficiently. Min Fang,Narayanan Shivakumar,Hector Garcia-Molina,Rajeev Motwani,Jeffrey D. Ullman 1998 Computing Iceberg Queries Efficiently. VLDB On Optimal Node Splitting for R-trees. Yván J. García,Mario A. Lopez,Scott T. Leutenegger 1998 On Optimal Node Splitting for R-trees. VLDB RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. Johannes Gehrke,Raghu Ramakrishnan,Venkatesh Ganti 1998 RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. VLDB DMS: A Parallel Data Mining Server. Felicity A. W. George 1998 DMS: A Parallel Data Mining Server. VLDB Secure Buffering in Firm Real-Time Database Systems. Binto George,Jayant R. Haritsa 1998 "Many real-time database applications arise in electronic financial services, safety-critical installations and military systems where enforcing is crucial to the success of the enterprise. We investigate here the performance implications, in terms of killed transactions, of guaranteeing multi-level secrecy in a real-time database system supporting applications with firm deadlines. In particular, we focus on the buffer management aspects of this issue.Our main contributions are the following. First, we identify the importance and difficulties of providing secure buffer management in the real-time database environment. Second, we present , a novel buffer management algorithm that provides covert-channel-free security. SABRE employs a fully dynamic one-copy allocation policy for efficient usage of buffer resources. It also incorporates several optimizations for reducing the overall number of killed transactions and for decreasing the unfairness in the distribution of killed transactions across security levels. Third, using a detailed simulation model, the real-time performance of SABRE is evaluated against unsecure conventional and real-time buffer management policies for a variety of security-classified transaction workloads and system configurations. Our experiments show that SABRE provides security with only a modest drop in real-time performance. Finally, we evaluate SABRE's performance when augmented with the GUARD adaptive admission control policy. Our experiments show that this combination provides close to ideal fairness for real-time applications that can tolerate covert-channel bandwidths of up to one bit per second (a limit specified in military standards)." VLDB Is Web-site Management a Database Problem? Daniela Florescu,Alon Y. Levy,Dan Suciu 1998 Is Web-site Management a Database Problem? VLDB Clustering Categorical Data: An Approach Based on Dynamical Systems. David Gibson,Jon M. Kleinberg,Prabhakar Raghavan 1998 We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By “categorical data,” we mean tables with fields that cannot be naturally ordered by a metric – e.g., the names of producers of automobiles, or the names of products offered by a manufacturer. Our approach is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. Our techniques can be studied analytically in terms of certain types of non-linear dynamical systems. VLDB nD-SQL: A Multi-Dimensional Language for Interoperability and OLAP. Frédéric Gingras,Laks V. S. Lakshmanan 1998 nD-SQL: A Multi-Dimensional Language for Interoperability and OLAP. VLDB Proximity Search in Databases. Roy Goldman,Narayanan Shivakumar,Suresh Venkatasubramanian,Hector Garcia-Molina 1998 Proximity Search in Databases. VLDB Design and Analysis of Parametric Query Optimization Algorithms. Sumit Ganguly 1998 Design and Analysis of Parametric Query Optimization Algorithms. VLDB Resource Scheduling for Composite Multimedia Objects. Minos N. Garofalakis,Yannis E. Ioannidis,Banu Özden 1998 Resource Scheduling for Composite Multimedia Objects. VLDB Expiring Data in a Warehouse. Hector Garcia-Molina,Wilburt Labio,Jun Yang 1998 Expiring Data in a Warehouse. VLDB Hash Joins and Hash Teams in Microsoft SQL Server. Goetz Graefe,Ross Bunker,Shaun Cooper 1998 Hash Joins and Hash Teams in Microsoft SQL Server. VLDB Binding Propagation in Disjunctive Databases. Sergio Greco 1998 Binding Propagation in Disjunctive Databases. VLDB Low-Cost Compensation-Based Query Processing. Øystein Grøvlen,Svein-Olaf Hvasshovd,Øystein Torbjørnsen 1998 Low-Cost Compensation-Based Query Processing. VLDB Diag-Join: An Opportunistic Join Algorithm for 1:N Relationships. Sven Helmer,Till Westmann,Guido Moerkotte 1998 Diag-Join: An Opportunistic Join Algorithm for 1:N Relationships. VLDB MindReader: Querying Databases Through Multiple Examples. Yoshiharu Ishikawa,Ravishankar Subramanya,Christos Faloutsos 1998 MindReader: Querying Databases Through Multiple Examples. VLDB Optimal Histograms with Quality Guarantees. H. V. Jagadish,Nick Koudas,S. Muthukrishnan,Viswanath Poosala,Kenneth C. Sevcik,Torsten Suel 1998 Optimal Histograms with Quality Guarantees. VLDB Performance Measurements of Tertiary Storage Devices. Theodore Johnson,Ethan L. Miller 1998 Performance Measurements of Tertiary Storage Devices. VLDB Checkpointing in Oracle. Ashok Joshi,William Bridge,Juan Loaiza,Tirthankar Lahiri 1998 Checkpointing in Oracle. VLDB Algorithms for Mining Distance-Based Outliers in Large Datasets. Edwin M. Knorr,Raymond T. Ng 1998 Algorithms for Mining Distance-Based Outliers in Large Datasets. VLDB Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. Flip Korn,Alexandros Labrinidis,Yannis Kotidis,Christos Faloutsos 1998 Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. VLDB Selectivity Estimation in Extensible Databases - A Neural Network Approach. M. Seetha Lakshmi,Shaoyu Zhou 1998 Selectivity Estimation in Extensible Databases - A Neural Network Approach. VLDB Querying Continuous Time Sequences. Ling Lin,Tore Risch 1998 Querying Continuous Time Sequences. VLDB Determining Text Databases to Search in the Internet. Weiyi Meng,King-Lup Liu,Clement T. Yu,Xiaodong Wang,Yuhsi Chang,Naphtali Rishe 1998 Determining Text Databases to Search in the Internet. VLDB A Single Pass Computing Engine for Interactive Analysis of VLDBs. Ted Mihalisin 1998 A Single Pass Computing Engine for Interactive Analysis of VLDBs. VLDB Using Schema Matching to Simplify Heterogeneous Data Translation. Tova Milo,Sagit Zohar 1998 Using Schema Matching to Simplify Heterogeneous Data Translation. VLDB MapInfo SpatialWare: A Spatial Information Server for RDBMS. Chebel Mina 1998 MapInfo SpatialWare: A Spatial Information Server for RDBMS. VLDB Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. Guido Moerkotte 1998 Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. VLDB Algorithms for Mining Association Rules for Binary Segmentations of Huge Categorical Databases. Yasuhiko Morimoto,Takeshi Fukuda,Hirofumi Matsuzawa,Takeshi Tokuyama,Kunikazu Yoda 1998 Algorithms for Mining Association Rules for Binary Segmentations of Huge Categorical Databases. VLDB Design, Implementation, and Performance of the LHAM Log-Structured History Data Access Method. "Peter Muth,Patrick E. O'Neil,Achim Pick,Gerhard Weikum" 1998 Design, Implementation, and Performance of the LHAM Log-Structured History Data Access Method. VLDB TOPAZ: a Cost-Based, Rule-Driven, Multi-Phase Parallelizer. Clara Nippl,Bernhard Mitschang 1998 TOPAZ: a Cost-Based, Rule-Driven, Multi-Phase Parallelizer. VLDB Objectivity Industrial Exhibit. 1998 Objectivity Industrial Exhibit. VLDB Fast High-Dimensional Data Search in Incomplete Databases. Beng Chin Ooi,Cheng Hian Goh,Kian-Lee Tan 1998 Fast High-Dimensional Data Search in Incomplete Databases. VLDB Starting (and Sometimes Ending) a Database Company. Jack A. Orenstein 1998 Starting (and Sometimes Ending) a Database Company. VLDB An Asynchronous Avoidance-Based Cache Consistency Algorithm for Client Caching DBMSs. M. Tamer Özsu,Kaladhar Voruganti,Ronald C. Unrau 1998 An Asynchronous Avoidance-Based Cache Consistency Algorithm for Client Caching DBMSs. VLDB Algorithms for Querying by Spatial Structure. Dimitris Papadias,Nikos Mamoulis,Vasilis Delis 1998 Algorithms for Querying by Spatial Structure. VLDB Oracle Industrial Exhibit. Amy Pogue 1998 Oracle Industrial Exhibit. VLDB On the Discovery of Interesting Patterns in Association Rules. Sridhar Ramaswamy,Sameer Mahajan,Abraham Silberschatz 1998 On the Discovery of Interesting Patterns in Association Rules. VLDB PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Rajeev Rastogi,Kyuseok Shim 1998 "Classification is an important problem in data mining. Given a database of records, each with a class label, a classifier generates a concise and meaningful description for each class that can be used to classify subsequent records. A number of popular classifiers construct decision trees to generate class models. These classifiers first build a decision tree and then prune subtrees from the decision tree in a subsequent i>pruning phase to improve accuracy and prevent “overfitting”.Generating the decision tree in two distinct phases could result in a substantial amount of wasted effort since an entire subtree constructed in the first phase may later be pruned in the next phase. In this paper, we propose PUBLIC, an improved decision tree classifier that integrates the second “pruning” phase with the initial “building” phase. In PUBLIC, a node is not expanded during the building phase, if it is determined that it will be pruned during the subsequent pruning phase. In order to make this determination for a node, before it is expanded, PUBLIC computes a lower bound on the minimum cost subtree rooted at the node. This estimate is then used by PUBLIC to identify the nodes that are certain to be pruned, and for such nodes, not expend effort on splitting them. Experimental results with real-life as well as synthetic data sets demonstrate the effectiveness of PUBLIC's integrated approach which has the ability to deliver substantial performance improvements." VLDB The Heterogeneity Problem and Middleware Technology: Experiences with and Performance of Database Gateways. Fernando de Ferreira Rezende,Klaudia Hergula 1998 The Heterogeneity Problem and Middleware Technology: Experiences with and Performance of Database Gateways. VLDB Bridging Heterogeneity: Research and Practice of Database Middleware Technology. Fernando de Ferreira Rezende,Günter Sauter 1998 Bridging Heterogeneity: Research and Practice of Database Middleware Technology. VLDB Active Storage for Large-Scale Data Mining and Multimedia. Erik Riedel,Garth A. Gibson,Christos Faloutsos 1998 Active Storage for Large-Scale Data Mining and Multimedia. VLDB The Cubetree Storage Organization. Nick Roussopoulos,Yannis Kotidis 1998 The Cubetree Storage Organization. VLDB "IBM's DB2 Universal Database demonstrations at VLDB'98." K. Bernhard Schiefer,Jim Kleewein,Karen Brannon,Guy M. Lohman,Gene Fuh 1998 "IBM's DB2 Universal Database demonstrations at VLDB'98." VLDB The ADABAS Buffer Pool Manager. Harald Schöning 1998 The ADABAS Buffer Pool Manager. VLDB Technology and the Future of Commerce and Finance (Abstract). David Elliot Shaw 1998 Technology and the Future of Commerce and Finance (Abstract). VLDB WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. Gholamhosein Sheikholeslami,Surojit Chatterjee,Aidong Zhang 1998 WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. VLDB Filtering with Approximate Predicates. Narayanan Shivakumar,Hector Garcia-Molina,Chandra Chekuri 1998 Filtering with Approximate Predicates. VLDB Materialized View Selection for Multidimensional Datasets. Amit Shukla,Prasad Deshpande,Jeffrey F. Naughton 1998 Materialized View Selection for Multidimensional Datasets. VLDB Scalable Techniques for Mining Causal Structures. Craig Silverstein,Sergey Brin,Rajeev Motwani,Jeffrey D. Ullman 1998 Mining for association rules in market basket data has proved a fruitful area of research. Measures such as conditional probability (confidence) and correlation have been used to infer rules of the form “the existence of item A implies the existence of item B.” However, such rules indicate only a statistical relationship between A and B. They do not specify the nature of the relationship: whether the presence of A causes the presence of B, or the converse, or some other attribute or phenomenon causes both to appear together. In applications, knowing such causal relationships is extremely useful for enhancing understanding and effecting change. While distinguishing causality from correlation is a truly difficult problem, recent work in statistics and Bayesian learning provide some avenues of attack. In these fields, the goal has generally been to learn complete causal models, which are essentially impossible to learn in large-scale data mining applications with a large number of variables.In this paper, we consider the problem of determining casual relationships, instead of mere associations, when mining market basket data. We identify some problems with the direct application of Bayesian learning ideas to mining large databases, concerning both the scalability of algorithms and the appropriateness of the statistical techniques, and introduce some initial ideas for dealing with these problems. We present experimental results from applying our algorithms on several large, real-world data sets. The results indicate that the approach proposed here is both computationally feasible and successful in identifying interesting causal structures. An interesting outcome is that it is perhaps easier to infer the lack of causality than to infer causality, information that is useful in preventing erroneous decision making. VLDB Massive Stochastic Testing of SQL. Donald R. Slutz 1998 Massive Stochastic Testing of SQL. VLDB The National Medical Knowledge Bank. Warren Sterling 1998 The National Medical Knowledge Bank. VLDB Atomicity versus Anonymity: Distributed Transactions for Electronic Commerce. J. D. Tygar 1998 Atomicity versus Anonymity: Distributed Transactions for Electronic Commerce. VLDB Heterogeneous Database Query Optimization in DB2 Universal DataJoiner. Shivakumar Venkataraman,Tian Zhang 1998 Heterogeneous Database Query Optimization in DB2 Universal DataJoiner. VLDB From Data Independence to Knowledge Independence: An on-going Story. Laurent Vieille 1998 From Data Independence to Knowledge Independence: An on-going Story. VLDB A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. Roger Weber,Hans-Jörg Schek,Stephen Blott 1998 A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB A Raster Approximation For Processing of Spatial Joins. Geraldo Zimbrao,Jano Moreira de Souza 1998 A Raster Approximation For Processing of Spatial Joins. VLDB Safely and Efficiently Updating References During On-line Reorganization. Chendong Zou,Betty Salzberg 1998 Safely and Efficiently Updating References During On-line Reorganization. VLDB Buffering and Read-Ahead Strategies for External Mergesort. Weiye Zhang,Per-Åke Larson 1998 Buffering and Read-Ahead Strategies for External Mergesort. SIGMOD Record Repositories and Object Oriented Databases. Philip A. Bernstein 1998 Repositories and Object Oriented Databases. SIGMOD Record "WebDB '98: International Workshop on the Web and Databases." Paolo Atzeni,Alberto O. Mendelzon 1998 "WebDB '98: International Workshop on the Web and Databases." SIGMOD Record Component-based E-Commerce: Assesment of Current Practices and Future Directions. Martin Bichler,Arie Segev,J. Leon Zhao 1998 Component-based e-commerce technology is a recent trend towards resolving the e-commerce challenge at both system and application levels. Instead of delivering a system as a prepacked monolith system containing any conceivable feature, component-based systems consist of a lightweight kernel to which new features can be added in the form of components. In order to identify the central problems in component-based e-commerce and ways to deal with them, we investigate prototypes, technologies, and frameworks that will transcend the current state of the practice in Internet commerce. In this paper, we first discuss the current practices and trends in component-based electronic commerce based on the International Workshop on Component-based Electronic Commerce. Then, we investigate a number of research issues and future directions in component-based development for electronic commerce. SIGMOD Record Design and Implementation of RMP - A Virtual Electronic Market Place. Susanne Boll,Wolfgang Klas,Bernard Battaglin 1998 Design and Implementation of RMP - A Virtual Electronic Market Place. SIGMOD Record The Asilomar Report on Database Research. Philip A. Bernstein,Michael L. Brodie,Stefano Ceri,David J. DeWitt,Michael J. Franklin,Hector Garcia-Molina,Jim Gray,Gerald Held,Joseph M. Hellerstein,H. V. Jagadish,Michael Lesk,David Maier,Jeffrey F. Naughton,Hamid Pirahesh,Michael Stonebraker,Jeffrey D. Ullman 1998 The Asilomar Report on Database Research. SIGMOD Record "KRDB '98: The 5th International Workshop on Knowledge Representation Meets Databases." Alexander Borgida,Vinay K. Chaudhri,Martin Staudt 1998 "KRDB '98: The 5th International Workshop on Knowledge Representation Meets Databases." SIGMOD Record Discovering Internet Marketing Intelligence through Online Analytical Web Usage Mining. Alex G. Büchner,Maurice D. Mulvenna 1998 This article describes a novel way of combining data mining techniques on Internet data in order to discover actionable marketing intelligence in electronic commerce scenarios. The data that is considered not only covers various types of server and web meta information, but also marketing data and knowledge. Furthermore, heterogeneity resolution thereof and Internet- and electronic commerce-specific pre-processing activities are embedded. A generic web log data hypercube is formally defined and schematic designs for analytical and predictive activities are given. From these materialised views, various online analytical web usage data mining techniques are shown, which include marketing expertise as domain knowledge and are specifically designed for electronic commerce purposes. SIGMOD Record T2: A Customizable Parallel Database for Multi-Dimensional Data. Chialin Chang,Anurag Acharya,Alan Sussman,Joel H. Saltz 1998 In this paper, we present T2, a customizable parallel database that integrates storage, retrieval and processing of multi-dimensional datasets. T2 provides support for common operations including index generation, data retrieval, memory management, scheduling of processing across a parallel machine and user interaction. It achieves its primary advantage from the ability to seamlessly integrate data retrieval and processing for a wide variety of applications and from the ability to maintain and jointly process multiple datasets with different underlying grids. Most other systems for multi-dimensional data have focused on uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. Many real datasets, however, are non-uniform or unstructured. For example, satellite data consists of a two dimensional strip that is embedded in a three dimensional space; water contamination studies use unstructured meshes to selectively simulate regions and so on. T2 can handle both uniform and non-uniform datasets. SIGMOD Record Database Research at Columbia University. Shih-Fu Chang,Luis Gravano,Gail E. Kaiser,Kenneth A. Ross,Salvatore J. Stolfo 1998 Database Research at Columbia University. SIGMOD Record Workshop Report on Experiences Using Object Data Management in the Real-World. Akmal B. Chaudhri 1998 "The OOPSLA '97 Workshop on Experiences Using Object Data Management in the Real-World was held at the Cobb Galleria Centre in Atlanta, Georgia on Monday 6 October 1997. This report summarises some of the commercial case-study presentations made by workshop participants." SIGMOD Record Applications of the JAVA Programming Language to Database Management. Bradley F. Burton,V. Wiktor Marek 1998 Applications of the JAVA Programming Language to Database Management. SIGMOD Record Enhanced Nearest Neighbour Search on the R-tree. King Lum Cheung,Ada Wai-Chee Fu 1998 Multimedia databases usually deal with huge amounts of data and it is necessary to have an indexing structure such that efficient retrieval of data can be provided. R-Tree with its variations, is a commonly cited indexing method. In this paper we propose an improved nearest neighbor search algorithm on the R-tree and its variants. The improvement lies in the removal of two hueristics that have been used in previous R*-tree work, which we prove cannot improve on the pruning power during a search. SIGMOD Record Towards On-Line Analytical Mining in Large Databases. Jiawei Han 1998 Great efforts have been paid in the Intelligent Database Systems Research Lab for the research and development of efficient data mining methods and construction of on-line analytical data mining systems.Our work has been focused on the integration of data mining and OLAP technologies and the development of scalable, integrated, and multiple data mining functions. A data mining system, DBMiner, has been developed for interactive mining of multiple-level knowledge in large relational databases and data warehouses. The system implements a wide spectrum of data mining functions, including characterization, comparison, association, classification, prediction, and clustering. It also builds up a user-friendly, interactive data mining environment and a set of knowledge visualization tools. In-depth research has been performed on the efficiency and scalability of data mining methods. Moreover, the research has been extended to spatial data mining, multimedia data mining, text mining, and Web mining with several new data mining system prototypes constructed or under construction, including GeoMiner, MultiMediaMiner, and WebLogMiner.This article summarizes our research and development activities in the last several years and shares our experiences and lessons with the readers. SIGMOD Record Electronic Market: The Roadmap for University Libraries and Members to Survive in the Information Jungle. Michael Christoffel,Sebastian Pulkowski,Bethina Schmitt,Peter C. Lockemann 1998 This contribution argues that electronic markets can serve as a powerful mechanism to entice providers to identify their customer base and to offer customer-oriented, high-quality and economical services and to induce customers to a more focused and price-conscious behavior. The paper claims that this should be particularly true for the provision and access to scientific literature where the tradition so far has been mostly free access by customers and non-transparent cost accounting and service procurement by university libraries. We report on a project for developing a technical network infrastructure that allows for a more cost-transparent access to scientific literature by campus users and attempts to add a competitive element to library services. Equally important, it provides added value to the users so that they can orient themselves in the vast expanses of scientific literature much faster and more economically. We cover three major elements of the infrastructure: user agents, traders and source wrappers. SIGMOD Record Building Database-driven Electronic Catalogs. Sherif Danish 1998 This paper describes issues and solutions related to the creation of a product information database in the enterprise, and using this database as a foundation for deploying an electronic catalog. Today, product information is typically managed in document composition systems and communicated on paper. In the new wired world, these processes are undertaking fundamental changes to cope with the time to market pressure and the need for accurate, complete, and structured presentation of product information. Electronic catalogs are the answer. SIGMOD Record "Guest Editor's Introduction." Asuman Dogac 1998 "Guest Editor's Introduction." SIGMOD Record A Workflow-based Electronic Marketplace on the Web. Asuman Dogac,Ilker Durusoy,Sena Nural Arpinar,Nesime Tatbul,Pinar Koksal,Ibrahim Cingil,Nazife Dimililer 1998 In this paper, we describe an architecture for an open marketplace exploiting the workflow technology and the currently emerging data exchange and metadata representation standards on the Web. In this market architecture electronic commerce is realized through the adaptable workflow templates provided by the marketplace to its users. Having workflow templates for electronic commerce processes results in a component-based architecture where components can be agents (both buying and selling) as well as existing applications invoked by the workflows. Other advantages provided by the workflow technology are forward recovery, detailed logging of the processes through workflow history manager and being able to specify data and control flow among the workflow components. In the architecture proposed, the resources expose their metadata using Resource Description Framework (RDF) to be accessed by the resource discovery agents and their content through Extensible Markup Language (XML) to be accessed by the selling agents by using Document Object Model (DOM). The common set of Document Type Definitions (DTDs) are used to eliminate the need for an ontology. The marketplace contains an Intelligent Directory Service (IDS) which makes it possible for agents to find out about each other through a match making mechanism. References to the related Document Type Definitions (DTDs) are stored in IDS. The IDS also contains the template workflows for buying and selling processes. SIGMOD Record An Anonymous Electronic Commerce Scheme with an Off-Line Authority and Untrusted Agents. Josep Domingo-Ferrer,Jordi Herrera-Joancomartí 1998 "In the last years, the exponential growth of computer networks has created an incredibly large offer of products and services in the net. Such a huge amount of information makes it impossible for a single person to analyze all existing offers of a product on the net and decide which of them fits better her requirements. This problem is solved with the intelligent trade agents (ITA), which are programs that have the ability to roam a network, collect business-related data and use them to make decisions to buy goods on their owners' behalf. Known ITA systems do not provide anonymity in transactions, require an on-line trusted third party and implicitly assume that the user trusts the ITA. We present a new scheme for an intelligent untrusted trade agent system allowing anonymous electronic transactions with an off-line trusted third party." SIGMOD Record Database Techniques for the World-Wide Web: A Survey. Daniela Florescu,Alon Y. Levy,Alberto O. Mendelzon 1998 Database Techniques for the World-Wide Web: A Survey. SIGMOD Record Standards in Practice. Andrew Eisenberg,Jim Melton 1998 Standards in Practice. SIGMOD Record SQLJ Part 0, Now Known as SQL/OLB (Object-Language Bindings). Andrew Eisenberg,Jim Melton 1998 SQLJ Part 0, Now Known as SQL/OLB (Object-Language Bindings). SIGMOD Record "Editor's Notes." Michael J. Franklin 1998 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 1998 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 1998 "Editor's Notes." SIGMOD Record Unbundling Active Functionality. Stella Gatziu,Arne Koschel,Günter von Bültzingsloewen,Hans Fritschi 1998 "New application areas or new technical innovations expect from database management systems more and more new functionality. However, adding functions to the DBMS as an integral part of them, tends to create monoliths that are difficult to design, implement, validate, maintain and adapt. Such monoliths can be avoided if one configures DBMS according to the actually needed functionality. In order to identify the basic functional components for the configuration the current monoliths should be broken up into smaller units, or in other words they could be ""unbundled"". In this paper we apply unbundling to active database systems. This results in a new form of active mechanisms where active functionality is no longer an integral part of the DBMS functionality. This allows the use of active capabilities with any arbitrary DBMS and in broader contexts. Furthermore, it allows the adaption of the active functionality to the application profile. Such aspects are crucial for a wide use of active functionality in real (database or not) applications." SIGMOD Record Algebraic Change Propagation for Semijoin and Outerjoin Queries. Timothy Griffin,Bharat Kumar 1998 Many interesting examples in view maintenance involve semijoin and outerjoin queries. In this paper we develop algebraic change propagation algorithms for the following operators: semijoin, anti-semijoin, left outerjoin, right outerjoin, and full outerjoin. SIGMOD Record ADEPT: An Agent-Based Approach to Business Process Management. Nicholas R. Jennings,Timothy J. Norman,Peyman Faratin 1998 "Successful companies organise and run their business activities in an efficient manner. Core activities are completed on time and within specified resource constraints. However to stay competitive in today's markets, companies need to continually improve their efficiency — business activities need to be completed more quickly, to higher quality and at lower cost. To this end, there is an increasing awareness of the benefits and potential competitive advantage that well designed business process management systems can provide. In this paper we argue the case for an agent-based approach: showing how agent technology can improve efficiency by ensuring that business activities are better scheduled, executed, monitored, and coordinated." SIGMOD Record The TriGS Active Object-Oriented Database System - An Overview. Gerti Kappel,Werner Retschitzegger 1998 The TriGS Active Object-Oriented Database System - An Overview. SIGMOD Record A Case for Intelligent Disks (IDISKs). Kimberly Keeton,David A. Patterson,Joseph M. Hellerstein 1998 Decision support systems (DSS) and data warehousing workloads comprise an increasing fraction of the database market today. I/O capacity and associated processing requirements for DSS workloads are increasing at a rapid rate, doubling roughly every nine to twelve months [38]. In response to this increasing storage and computational demand, we present a computer architecture for decision support database servers that utilizes “intelligent” disks (IDISKs). IDISKs utilize low-cost embedded general-purpose processing, main memory, and high-speed serial communication links on each disk. IDISKs are connected to each other via these serial links and high-speed crossbar switches, overcoming the I/O bus bottleneck of conventional systems. By off-loading computation from expensive desktop processors, IDISK systems may improve cost-performance. More importantly, the IDISK architecture allows the processing of the system to scale with increasing storage demand. SIGMOD Record Workflow History Management. Pinar Koksal,Sena Nural Arpinar,Asuman Dogac 1998 A workflow history manager maintains the information essential for workflow monitoring and data mining as well as for recovery and authorization purposes.Certain characteristics of workflow systems like the necessity to run these systems on heterogeneous, autonomous and distributed environments and the nature of data, prevent history management in workflows to be handled by the classical data management techniques like distributed DBMSs. We further demonstrate that multi-database query processing techniques are also not appropriate for the problem at hand.In this paper, we describe history management, i.e., the structure of the history and querying of the history, in a fully distributed workflow architecture realized in conformance with Object Management Architecture (OMA) of OMG. By fully distributed architecture we mean that the scheduler of the workflow system is distributed and in accordance with this, the history objects related with activities are stored on data repositories (like DBMSs, files) available at the sites involved. We describe the structure of the history objects determined according to the nature of the data and the processing needs, and the possible query processing strategies on these objects using the Object Query Service of OMG. We then present the comparison of these strategies according to a cost model developed. SIGMOD Record Mining Fuzzy Association Rules in Databases. Chan Man Kuok,Ada Wai-Chee Fu,Man Hon Wong 1998 "Data mining is the discovery of previously unknown, potentially useful and hidden knowledge in databases. In this paper, we concentrate on the discovery of association rules. Many algorithms have been proposed to find association rules in databases with binary attributes. We introduce the fuzzy association rules of the form, 'If X is A then Y is B', to deal with quantitative attributes. X, Y are set of attributes and A, B are fuzzy sets which describe X and Y respectively. Using the fuzzy set concept, the discovered rules are more understandable to human. Moreover, fuzzy sets handle numerical values better than existing methods because fuzzy sets soften the effect of sharp boundaries." SIGMOD Record B-tree Page Size When Caching is Considered. David B. Lomet 1998 B-tree Page Size When Caching is Considered. SIGMOD Record The Microsoft Database Research Group. David B. Lomet,Roger S. Barga,Surajit Chaudhuri,Per-Åke Larson,Vivek R. Narasayya 1998 The Microsoft Database Research Group. SIGMOD Record Towards a Richer Web Object Model. Frank Manola 1998 "The World Wide Web is becoming an increasingly important factor in planning for enterprise distributed computing environments, both to support external access to enterprise systems and information (e.g., by customers, suppliers, and partners), and to support internal enterprise operations. Organizations perceive a number of advantages in using the Web in enterprise computing, a particular advantage being that it provides an information representation which• supports interlinking of all kinds of content• is easy for end-users to access• supports easy content creation using widely-available toolsHowever, as organizations have attempted to employ the Web in increasingly sophisticated applications, these applications have begun to overlap in complexity the sorts of distributed applications for which distributed object architectures such as OMG's CORBA, and its surrounding Object Management Architecture (OMA) [Sol95] were originally developed. Since the Web was not originally designed to support such applications, Web application development efforts increasingly run into limitations of the basic Web infrastructure.If the Web is to be used as the basis of complex enterprise applications, it must provide generic capabilities similar to those provided by the OMA (although these may need to be adapted to the more open, flexible nature of the Web, and specific requirements of Web applications). This involves such things as providing database-like services (such as enhanced query and transaction support) and their composition in the Web. However, the basic data structuring capabilities provided by the Web (its ""object model"") must also be addressed, since the ability to define and apply powerful generic services in the Web, and the ability to generally use the Web to support complex applications, depends crucially on the ability of the Web's underlying data structuring facilities to support these complex applications and services." SIGMOD Record XML and Electronic Commerce: Enabling the Network Economy. Bart Meltzer,Robert J. Glushko 1998 There has been a lot of talk about how the Internet is going to change the world economy. Companies will come together in a “plug and play” fashion to form trading partner networks. Virtual companies will be established and new business models can be created based on access to information and agents that can carry it around the world using computer networks. SIGMOD Record "Information Director's Message." Alberto O. Mendelzon 1998 "Information Director's Message." SIGMOD Record "Report on the Second IEEE Metadata Conference (Metadata '97)." Ron Musick,Christopher Miller 1998 "On September 15th and 16th, 1997 the Second IEEE Metadata Conference was held at the National Oceanic and Atmospheric Administration (NOAA) complex in Silver Spring, Maryland. The main objectives of this conference series are to provide a forum to address metadata issues faced by various communities, promote the interchange of ideas on common technologies and standards related to metadata, and facilitate the development and usage of metadata. Metadata'97 met these objectives, drawing about 280 registered attendees from ten different countries and over one hundred different institutions. The audience included scientists, information technology specialists, and librarians from communities as widespread as finance, climatology, and mass storage. The technical program included two keynote addresses, two panel presentations, as well as twenty-three papers and thirteen posters selected from over one hundred abstracts. We provide highlights of the conference below. For more details, the proceedings are available electronically from the conference homepage at: http://www.llnl.gov/liv_comp/metadata/md97.html.The keynote addresses were ""An Architecture for Metadata: The Dublin Core, and why you don't have to like it"" by Stuart Weibel, OCLC, and ""The Microsoft Repository"" by Philip Bernstein, Microsoft.Weibel's talk described the Dublin core and the Warwick framework - a series of workshops whose output has been a core set of metadata elements common to data from most domains, along with a ""container"" based mechanism for plugging in larger domain-specific sets of metadata, like the FGDC's standard for geospatial metadata. These efforts represent two of the defining works in this community. Weibel touched on the history of this effort and described his belief that standards such as RDF (resource description framework) for the WWW coming from organizations like the World Wide Web Consortium (W3C) will have a major influence on the metadata community in the near future.Bernstein's talk covered the Microsoft Object Repository, which also has the potential for a large impact on what metadata gets stored and how they are managed. Bernstein describes the repository as ""a place to persist COM objects"" (component object model), and as more than just a object-oriented database. The features of a true repository are 1) objects and properties, 2) rich relational semantics, 3) extensibility, and 4) versioning. Repositories are used to help tools interoperate by storing predefined ""information models"". The information models are the metadata used to describe the underlying COM objects in a standard way such that the objects can be shared across tool boundaries. The main consumers of this type of technology are tool vendors." SIGMOD Record A Componentized Architecture for Dynamic Electronic Markets. Benny Reich,Israel Ben-Shaul 1998 The emergence and growing popularity of Internet-based electronic market-places, in their various forms, has raised the challenge to explore genericity in market design. In this paper we present a domain-specific software architecture that delineates the abstract components of a generic market and specifies control and data-flow constraints between them, and a framework that allows convenient pluggability of components that implement specific market policies. The framework was realized in the GEM system. GEM provides infrastructure services that allow market designers to focus solely on market-issues. In addition, it allows dynamic (re)configuration of components. This functionality can be used to change market-policies as the environment or market trends change, adding another level of flexibility to market designers and administrators. SIGMOD Record The Middleware Muddle. David Ritter 1998 A new menagerie of middleware is emerging. These products promise great flexibility in partitioning enterprise applications across the diverse corporate computing landscape. What factors should you consider when choosing a solution, and how do current products stack up? More important to the focus of this article, what role should Web servers play? SIGMOD Record Where Will Object Techonology Drive Data Administration? Arnon Rosenthal 1998 "Several unifications that the application development process has long needed are now occurring, due to developments in object technologies and standards. These will (gradually) change the way data intensive applications are developed, reduce databases' prominence in this process, and change data administration's goals and participants. At the same time, the database community needs to ensure that its experiences are leveraged and its concerns are met within the new methodologies and toolsets. We discuss these issues, and illustrate how they apply to a portion of the Department of Defense (DOD). We also examine things that object technology won't accomplish, and identify research problems whose solution would enable further progress." SIGMOD Record Materialized Views and Data Warehouses. Nick Roussopoulos 1998 Materialized Views and Data Warehouses. SIGMOD Record Predator: A Resource for Database Research. Praveen Seshadri 1998 This paper describes PREDATOR, a freely available object-relational database system that has been developed at Cornell University. A major motivation in developing PREDATOR was to create a modern code base that could act as a research vehicle for the database community. Pursuing this goal, this paper briefly describes several features of the system that should make it attractive for database research and education. SIGMOD Record "Chair's Message." Richard T. Snodgrass 1998 "Chair's Message." SIGMOD Record Reminiscences on Influential Papers. Richard T. Snodgrass 1998 Reminiscences on Influential Papers. SIGMOD Record "Chair's Message." Richard T. Snodgrass 1998 "Chair's Message." SIGMOD Record "Chair's Message." Richard T. Snodgrass 1998 "Chair's Message." SIGMOD Record Reminiscences on Influential Papers. Richard T. Snodgrass,Laura M. Haas,Alberto O. Mendelzon,Z. Meral Özsoyoglu,Jan Paredaens,Krithi Ramamritham,Nick Roussopoulos,Jennifer Widom,Philip S. Yu 1998 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Richard T. Snodgrass,Hector Garcia-Molina,Tomasz Imielinski,David Maier,Patricia G. Selinger,Jeffrey D. Ullman 1998 Reminiscences on Influential Papers. SIGMOD Record An Extensible Notation for Spatiotemporal Index Queries. Vassilis J. Tsotras,Christian S. Jensen,Richard T. Snodgrass 1998 Temporal, spatial and spatiotemporal queries are inherently multidimensional, combining predicates on explicit attributes with predicates on time dimension(s) and spatial dimension(s). Much confusion has prevailed in the literature on access methods because no consistent notation exists for referring to such queries. As a contribution towards eliminating this problem, we propose a new and simple notation for spatiotemporal queries. The notation aims to address the selection-based spatiotemporal queries commonly studied in the literature of access methods. The notation is extensible and can be applied to more general multidimensional, selection-based queries. ICDE Scalable Web Server Design for Distributed Data Management. Scott M. Baker,Bongki Moon 1999 Scalable Web Server Design for Distributed Data Management. ICDE Systematic Multiresolution and Its Application to the World Wide Web. Swarup Acharya,Henry F. Korth,Viswanath Poosala 1999 Systematic Multiresolution and Its Application to the World Wide Web. ICDE Algorithms for Index-Assisted Selectivity Estimation. Paul M. Aoki 1999 The standard mechanisms for query selectivity estimation used in relational database systems (e.g., histograms and quantile values) rely on properties specific to the attribute types (e.g., the ordering of numeric values). The query optimizer in an object-relational database system will, in general, be unable to exploit these mechanisms for user-defined types, requiring the user to create entirely new estimation mechanisms. Worse, writing the selectivity estimation routines is extremely difficult because the software interfaces provided by vendors are relatively low-level. In this paper, we discuss extensions of the generalized search tree, or GiST, to support user-defined selectivity estimation in a variety of ways. We discuss the computation of selectivity estimates with confidence intervals over arbitrary data types using indices, give methods for combining this technique with random sampling, and present results from an experimental comparison of these methods with several estimators from the literature. ICDE Developing a DataBlade for a New Index. Rasa Bliujute,Simonas Saltenis,Giedrius Slivinskas,Christian S. Jensen 1999 Developing a DataBlade for a New Index. ICDE Constraint-Based Rule Mining in Large, Dense Databases. Roberto J. Bayardo Jr.,Rakesh Agrawal,Dimitrios Gunopulos 1999 Constraint-based rule miners find all rules in a given data-set meeting user-specified constraints such as minimum support and confidence. We describe a new algorithm that directly exploits all user-specified constraints including minimum support, minimum confidence, and a new constraint that ensures every mined rule offers a predictive advantage over any of its simplifications. Our algorithm maintains efficiency even at low supports on data that is dense (e.g. relational tables). Previous approaches such as Apriori and its variants exploit only the minimum support constraint, and as a result are ineffective on dense data due to a combinatorial explosion of “frequent itemsets”. ICDE Using Codewords to Protect Database Data from a Class of Software Errors. Philip Bohannon,Rajeev Rastogi,S. Seshadri,Abraham Silberschatz,S. Sudarshan 1999 Using Codewords to Protect Database Data from a Class of Software Errors. ICDE Using Java and CORBA for Implementing Internet Databases. Athman Bouguettaya,Boualem Benatallah,Mourad Ouzzani,Lily Hendra 1999 Using Java and CORBA for Implementing Internet Databases. ICDE The Bulk Index Join: A Generic Approach to Processing Non-Equijoins. Jochen Van den Bercken,Bernhard Seeger,Peter Widmayer 1999 The Bulk Index Join: A Generic Approach to Processing Non-Equijoins. ICDE Indexing Constraint Databases by Using a Dual Representation. Elisa Bertino,Barbara Catania,Boris Chidlovskii 1999 Indexing Constraint Databases by Using a Dual Representation. ICDE Que Sera, Sera: The Coincidental Confluence of Economics, Business, and Collaborative Computing. Michael L. Brodie 1999 Que Sera, Sera: The Coincidental Confluence of Economics, Business, and Collaborative Computing. ICDE Index Merging. Surajit Chaudhuri,Vivek R. Narasayya 1999 Index Merging. ICDE Declarative and Procedural Object-Oriented Views. Ralph Busse,Peter Fankhauser 1999 Declarative and Procedural Object-Oriented Views. ICDE TP-Monitor-based Workflow Management System Architecture. Christoph Bussler 1999 TP-Monitor-based Workflow Management System Architecture. ICDE Integrating Heterogeneous OO Schemas. Yangjun Chen,Wolfgang Benn 1999 Integrating Heterogeneous OO Schemas. ICDE Ad Hoc OLAP: Expression and Evaluation. Damianos Chatziantoniou 1999 Ad Hoc OLAP: Expression and Evaluation. ICDE The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces. Kaushik Chakrabarti,Sharad Mehrotra 1999 The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces. ICDE Scalable Classification over SQL Databases. Surajit Chaudhuri,Usama M. Fayyad,Jeff Bernhardt 1999 Scalable Classification over SQL Databases. ICDE Efficient Time Series Matching by Wavelets. Kin-pong Chan,Ada Wai-Chee Fu 1999 Efficient Time Series Matching by Wavelets. ICDE Raster-Spatial Data Declustering Revisited: An Interactive Navigation Perspective. Chung-Min Chen,Rakesh K. Sinha 1999 Raster-Spatial Data Declustering Revisited: An Interactive Navigation Perspective. ICDE Universal Temporal Extensions for Database Languages. Cindy Xinmin Chen,Carlo Zaniolo 1999 Universal Temporal Extensions for Database Languages. ICDE Data Integration by Describing Sources with Constraint Databases. Xun Cheng,Guozhu Dong,Tzekwan Lau,Jianwen Su 1999 Data Integration by Describing Sources with Constraint Databases. ICDE Database Extensions for Complex Forms of Data (Abstract). Samuel DeFazio 1999 Database Extensions for Complex Forms of Data (Abstract). ICDE A Database Approach for Modeling and Querying Video Data. Cyril Decleir,Mohand-Said Hacid,Jacques Kouloumdjian 1999 Indexing video data is essential for providing content based access. In this paper, we consider how database technology can offer an integrated framework for modeling and querying video data. As many concerns in video (e.g., modeling and querying) are also found in databases, databases provide an interesting angle to attack many of the problems. From a video applications perspective, database systems provide a nice basis for future video systems. More generally, database research will provide solutions to many video issues even if these are partial or fragmented. From a database perspective, video applications provide beautiful challenges. Next generation database systems will need to provide support for multimedia data (e.g., image, video, audio). These data types require new techniques for their management (i.e., storing, modeling, querying, etc.). Hence new solutions are significant. This paper develops a data model and a rule-based query language for video content based indexing and retrieval. The data model is designed around the object and constraint paradigms. A video sequence is split into a set of fragments. Each fragment can be analyzed to extract the information (symbolic descriptions) of interest that can be put into a database. This database can then be searched to find information of interest. Two types of information are considered: (1) the entities (objects) of interest in the domain of a video sequence, (2) video frames which contain these entities. To represent these information, our data model allows facts as well as objects and constraints. We present a declarative, rule-based, constraint query language that can be used to infer relationships about information represented in the model. The language has a clear declarative and operational semantics. ICDE Data Organization and Access for Efficient Data Mining. Brian Dunkel,Nandit Soparkar 1999 Data Organization and Access for Efficient Data Mining. ICDE Concentric Hyperspaces and Disk Allocation for Fast Parallel Range Searching. Hakan Ferhatosmanoglu,Divyakant Agrawal,Amr El Abbadi 1999 Concentric Hyperspaces and Disk Allocation for Fast Parallel Range Searching. ICDE Storage of Multidimensional Arrays Based on Arbitrary Tiling. Paula Furtado,Peter Baumann 1999 Storage of Multidimensional Arrays Based on Arbitrary Tiling. ICDE Clustering Large Datasets in Arbitrary Metric Spaces. Venkatesh Ganti,Raghu Ramakrishnan,Johannes Gehrke,Allison L. Powell,James C. French 1999 Clustering Large Datasets in Arbitrary Metric Spaces. ICDE Capability-Sensitive Query Processing on Internet Sources. Hector Garcia-Molina,Wilburt Labio,Ramana Yerneni 1999 Capability-Sensitive Query Processing on Internet Sources. ICDE Relative Prefix Sums: An Efficient Approach for Querying Dynamic OLAP Data Cubes. Steven Geffner,Divyakant Agrawal,Amr El Abbadi,Terence R. Smith 1999 "Range sum queries on data cubes are a powerful tool for analysis. A range sum query applies an aggregation operation (e.g., SUM) over all selected cells in a data cube, where the selection is specified by providing ranges of values for numeric dimensions. Many application domains require that information provided by analysis tools be current or ""near-current."" Existing techniques for range sum queries on data cubes, however, can incur update costs on the order of the size of the data cube. Since the size of a data cube is exponential in the number of its dimensions, rebuilding the entire data cube can be very costly. We present an approach that achieves constant time range sum queries while constraining update costs. Our method reduces the overall complexity of the range sum problem." ICDE Parallel Algorithms for Computing Temporal Aggregates. Jose Alvin G. Gendrano,Bruce C. Huang,Jim M. Rodrigue,Bongki Moon,Richard T. Snodgrass 1999 Parallel Algorithms for Computing Temporal Aggregates. ICDE ROCK: A Robust Clustering Algorithm for Categorical Attributes. Sudipto Guha,Rajeev Rastogi,Kyuseok Shim 1999 ROCK: A Robust Clustering Algorithm for Categorical Attributes. ICDE Efficient Mining of Partial Periodic Patterns in Time Series Database. Jiawei Han,Guozhu Dong,Yiwen Yin 1999 Efficient Mining of Partial Periodic Patterns in Time Series Database. ICDE Scalable Trigger Processing. Eric N. Hanson,Chris Carnes,Lan Huang,Mohan Konyala,Lloyd Noronha,Sashi Parthasarathy,J. B. Park,Albert Vernon 1999 Scalable Trigger Processing. ICDE Scheduling and Data Replication to Improve Tape Jukebox Performance. Bruce Hillyer,Rajeev Rastogi,Abraham Silberschatz 1999 Scheduling and Data Replication to Improve Tape Jukebox Performance. ICDE Maintaining Data Cubes under Dimension Updates. Carlos A. Hurtado,Alberto O. Mendelzon,Alejandro A. Vaisman 1999 Maintaining Data Cubes under Dimension Updates. ICDE Improving RAID Performance Using a Multibuffer Technique. Kien A. Hua,Khanh Vu,Ta-Hsiung Hu 1999 Improving RAID Performance Using a Multibuffer Technique. ICDE Policies in a Resource Manager of Workflow Systems: Modeling, Enforcement and Management. Yan-Nong Huang,Ming-Chien Shan 1999 Policies in a Resource Manager of Workflow Systems: Modeling, Enforcement and Management. ICDE Document Warehousing Based on a Multimedia Database System. Hiroshi Ishikawa,Kazumi Kubota,Yasuo Noguchi,Koki Kato,Miyuki Ono,Naomi Yoshizawa,Yasuhiko Kanemasa 1999 Document Warehousing Based on a Multimedia Database System. ICDE Design and Evaluation of Disk Scheduling Policies for High-Demand Multimedia Servers. M. Farrukh Khan,Arif Ghafoor,Muhammad Naeem Ayyaz 1999 Design and Evaluation of Disk Scheduling Policies for High-Demand Multimedia Servers. ICDE A Messaging-Based Architecture for Enterprise Application Integration. Tommy Joseph 1999 A Messaging-Based Architecture for Enterprise Application Integration. ICDE An Agent-Based Approach to Extending the Native Active Capability of Relational Database Systems. Lijuan Li,Sharma Chakravarthy 1999 An Agent-Based Approach to Extending the Native Active Capability of Relational Database Systems. ICDE Real-Time Data Access Control on B-Tree Index Structures. Tei-Wei Kuo,Chih-Hung Wei,Kam-yiu Lam 1999 Real-Time Data Access Control on B-Tree Index Structures. ICDE Query Routing in Large-Scale Digital Library Systems. Ling Liu 1999 Query Routing in Large-Scale Digital Library Systems. ICDE Complements for Data Warehouses. Dominique Laurent,Jens Lechtenbörger,Nicolas Spyratos,Gottfried Vossen 1999 Complements for Data Warehouses. ICDE Tape-Disk Join Strategies under Disk Contention. Achim Kraiss,Peter Muth,Michael Gillmann 1999 Tape-Disk Join Strategies under Disk Contention. ICDE Data Warehouse Evolution: Trade-Offs between Quality and Cost of Query Rewritings. Amy J. Lee,Andreas Koeller,Anisoara Nica,Elke A. Rundensteiner 1999 Data Warehouse Evolution: Trade-Offs between Quality and Cost of Query Rewritings. ICDE Efficient Theme and Non-Trivial Repeating Pattern Discovering in Music Databases. Chih-Chin Liu,Jia-Lien Hsu,Arbee L. P. Chen 1999 Efficient Theme and Non-Trivial Repeating Pattern Discovering in Music Databases. ICDE Confirmation: A Solution for Non-Compensatability in Workflow Applications. Chengfei Liu,Maria E. Orlowska,Xiaofang Zhou,Xuemin Lin 1999 Confirmation: A Solution for Non-Compensatability in Workflow Applications. ICDE Using XML in Relational Database Applications (Abstract). Susan Malaika 1999 Using XML in Relational Database Applications (Abstract). ICDE Processing Operations with Restrictions in RDBMS without External Sorting: The Tetris Algorithm. Volker Markl,Martin Zirkel,Rudolf Bayer 1999 Processing Operations with Restrictions in RDBMS without External Sorting: The Tetris Algorithm. ICDE Estimating the Usefulness of Search Engines. Weiyi Meng,King-Lup Liu,Clement T. Yu,Wensheng Wu,Naphtali Rishe 1999 Estimating the Usefulness of Search Engines. ICDE Semantic Brokering over Dynamic Heterogeneous Data Sources in InfoSleuth. Marian H. Nodine,William Bohrer,Anne H. H. Ngu 1999 Semantic Brokering over Dynamic Heterogeneous Data Sources in InfoSleuth. ICDE Optimizer and Parallel Engine Extensions for Handling Expensive Methods Based on Large Objects. "William O'Connell,Felipe Cariño,G. Linderman" 1999 Optimizer and Parallel Engine Extensions for Handling Expensive Methods Based on Large Objects. ICDE Integrating Light-Weight Workflow Management Systems within Existing Business Environments. Peter Muth,Jeanine Weißenfels,Michael Gillmann,Gerhard Weikum 1999 Integrating Light-Weight Workflow Management Systems within Existing Business Environments. ICDE Query Processing Issues in Image (Multimedia) Databases. Surya Nepal,M. V. Ramakrishna 1999 Query Processing Issues in Image (Multimedia) Databases. ICDE Enhancing Semistructured Data Mediators with Document Type Definitions. Yannis Papakonstantinou,Pavel Velikhov 1999 Enhancing Semistructured Data Mediators with Document Type Definitions. ICDE Mobile Agents for WWW Distributed Database Access. Stavros Papastavrou,George Samaras,Evaggelia Pitoura 1999 Mobile Agents for WWW Distributed Database Access. ICDE Multidimensional Data Modeling for Complex Data. Torben Bach Pedersen,Christian S. Jensen 1999 Multidimensional Data Modeling for Complex Data. ICDE Multiversion Reconciliation for Mobile Databases. Shirish Hemant Phatak,B. R. Badrinath 1999 Multiversion Reconciliation for Mobile Databases. ICDE Fast Approximate Query Answering Using Precomputed Statistics. Viswanath Poosala,Venkatesh Ganti 1999 Fast Approximate Query Answering Using Precomputed Statistics. ICDE Fast Approximate Search Algorithm for Nearest Neighbor Queries in High Dimensions. Sakti Pramanik,Jinhua Li 1999 Fast Approximate Search Algorithm for Nearest Neighbor Queries in High Dimensions. ICDE I/O Complexity for Range Queries on Region Data Stored Using an R-tree. Guido Proietti,Christos Faloutsos 1999 I/O Complexity for Range Queries on Region Data Stored Using an R-tree. ICDE Data Mining from an AI Perspective (Abstract). J. Ross Quinlan 1999 Data Mining from an AI Perspective (Abstract). ICDE On Similarity-Based Queries for Time Series Data. Davood Rafiei 1999 On Similarity-Based Queries for Time Series Data. ICDE Mining Optimized Support Rules for Numeric Attributes. Rajeev Rastogi,Kyuseok Shim 1999 Mining Optimized Support Rules for Numeric Attributes. ICDE Heterogeneous Query Processing through SQL Table Functions. Berthold Reinwald,Hamid Pirahesh,Ganapathy Krishnamoorthy,George Lapis,Brian T. Tran,Swati Vora 1999 Heterogeneous Query Processing through SQL Table Functions. ICDE Working Together in Harmony - An Implementation of the CORBA Object Query Service and Its Evaluation. Uwe Röhm,Klemens Böhm 1999 Working Together in Harmony - An Implementation of the CORBA Object Query Service and Its Evaluation. ICDE Multiple Index Structures for Efficient Retrieval of 2D Objects. Cyrus Shahabi,Maytham Safar,Hezhi Ai 1999 Multiple Index Structures for Efficient Retrieval of 2D Objects. ICDE Exploiting Data Lineage for Parallel Optimization in Extensible DBMSs. Eddie C. Shek,Richard R. Muntz 1999 Exploiting Data Lineage for Parallel Optimization in Extensible DBMSs. ICDE Improving the Access Time Performance of Serpentine Tape Drives. Olav Sandstå,Roger Midtstraum 1999 Improving the Access Time Performance of Serpentine Tape Drives. ICDE A Graph Query Language and Its Query Processing. Lei Sheng,Z. Meral Özsoyoglu,Gultekin Özsoyoglu 1999 A Graph Query Language and Its Query Processing. ICDE Database as an Application Integration Platform (Abstract). Aahok R. Saxena 1999 Database as an Application Integration Platform (Abstract). ICDE The ECHO Method: Concurrency Control Method for a Large-Scale Distributed Database. Yukari Shirota,Atsushi Iizawa,Hiroko Mano,Takashi Yano 1999 The ECHO Method: Concurrency Control Method for a Large-Scale Distributed Database. ICDE Cooperative Caching in Append-only Databases with Hot Spots. Aman Sinha,Craig M. Chase,Munir Cochinwala 1999 Cooperative Caching in Append-only Databases with Hot Spots. ICDE Managing Distributed Memory to Meet Multiclass Workload Response Time Goals. Markus Sinnwell,Arnd Christian König 1999 Managing Distributed Memory to Meet Multiclass Workload Response Time Goals. ICDE On Getting Some Answers Quickly, and Perhaps More Later. Kian-Lee Tan,Cheng Hian Goh,Beng Chin Ooi 1999 On Getting Some Answers Quickly, and Perhaps More Later. ICDE Similarity Searching in Text Databases with Multiple Field Types. Kostas Tzeras,Euripides G. M. Petrakis 1999 Similarity Searching in Text Databases with Multiple Field Types. ICDE A Transparent Replication of HTTP Service. Radek Vingralek,Yuri Breitbart,Mehmet Sayal,Peter Scheuermann 1999 A Transparent Replication of HTTP Service. ICDE STING+: An Approach to Active Spatial Data Mining. Wei Wang,Jiong Yang,Richard R. Muntz 1999 STING+: An Approach to Active Spatial Data Mining. ICDE Parallel Classification for Data Mining on Shared-Memory Multiprocessors. Mohammed Javeed Zaki,Ching-Tien Ho,Rakesh Agrawal 1999 Parallel Classification for Data Mining on Shared-Memory Multiprocessors. ICDE Business Objects and Application Integration (Abstract). Saydean Zeldin 1999 Business Objects and Application Integration (Abstract). ICDE A Hypertext Database for Advanced Sharing of Distributed Web Pages. Takanori Yamakita,Takashi Fuji 1999 A Hypertext Database for Advanced Sharing of Distributed Web Pages. ICDE Speeding up Heterogeneous Data Access by Converting and Pushing Down String Comparisons. Weiye Zhang,Per-Åke Larson 1999 Speeding up Heterogeneous Data Access by Converting and Pushing Down String Comparisons. ICDE Formal Semantics of Composite Events for Distributed Environments. Shuang Yang,Sharma Chakravarthy 1999 Formal Semantics of Composite Events for Distributed Environments. ICDE Atomic Commitment in Database Systems over Active Networks. Zhili Zhang,William Perrizo,Victor T.-S. Shi 1999 Atomic Commitment in Database Systems over Active Networks. ICDE Data Warehouse Maintenance under Concurrent Schema and Data Updates. Xin Zhang,Elke A. Rundensteiner 1999 Data Warehouse Maintenance under Concurrent Schema and Data Updates. ICDE An Index Structure for Spatial Joins in Linear Constraint Databases. Hongjun Zhu,Jianwen Su,Oscar H. Ibarra 1999 An Index Structure for Spatial Joins in Linear Constraint Databases. ICDE Fat-Btree: An Update-Conscious Parallel Directory Structure. Haruo Yokota,Yasuhiko Kanemasa,Jun Miyazaki 1999 Fat-Btree: An Update-Conscious Parallel Directory Structure. ICDE On the Semantics of Complex Events in Active Database Management Systems. Detlef Zimmer,Rainer Unland 1999 On the Semantics of Complex Events in Active Database Management Systems. ICDE Hash in Place with Memory Shifting: Datacube Computation Revisited. Jeffrey Xu Yu,Hongjun Lu 1999 Hash in Place with Memory Shifting: Datacube Computation Revisited. SIGMOD Conference OPTICS: Ordering Points To Identify the Clustering Structure. Mihael Ankerst,Markus M. Breunig,Hans-Peter Kriegel,Jörg Sander 1999 "Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only 'traditional' clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data." SIGMOD Conference Self-tuning Histograms: Building Histograms Without Looking at Data. Ashraf Aboulnaga,Surajit Chaudhuri 1999 In this paper, we introduce self-tuning histograms. Although similar in structure to traditional histograms, these histograms infer data distributions not by examining the data or a sample thereof, but by using feedback from the query execution engine about the actual selectivity of range selection operators to progressively refine the histogram. Since the cost of building and maintaining self-tuning histograms is independent of the data size, self-tuning histograms provide a remarkably inexpensive way to construct histograms for large data sets with little up-front costs. Self-tuning histograms are particularly attractive as an alternative to multi-dimensional traditional histograms that capture dependencies between attributes but are prohibitively expensive to build and maintain. In this paper, we describe the techniques for initializing and refining self-tuning histograms. Our experimental results show that self-tuning histograms provide a low-cost alternative to traditional multi-dimensional histograms with little loss of accuracy for data distributions with low to moderate skew. SIGMOD Conference The Aqua Approximate Query Answering System. Swarup Acharya,Phillip B. Gibbons,Viswanath Poosala,Sridhar Ramaswamy 1999 Aqua is a system for providing fast, approximate answers to aggregate queries, which are very common in OLAP applications. It has been designed to run on top of any commercial relational DBMS. Aqua precomputes synopses (special statistical summaries) of the original data and stores them in the DBMS. It provides approximate answers along with quality guarantees by rewriting the queries to run on these synopses. Finally, Aqua keeps the synopses up-to-date as the database changes, using fast incremental maintenance techniques. SIGMOD Conference Join Synopses for Approximate Query Answering. Swarup Acharya,Phillip B. Gibbons,Viswanath Poosala,Sridhar Ramaswamy 1999 In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations. We propose join synopses as an effective solution for this problem and show how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. We present optimal strategies for allocating the available space among the various join synopses when the query work load is known and identify heuristics for the common case when the work load is not known. We also present efficient algorithms for incrementally maintaining join synopses in the presence of updates to the base relations. Our extensive set of experiments on the TPC-D benchmark database show the effectiveness of join synopses and various other techniques proposed in this paper. SIGMOD Conference Selectivity Estimation in Spatial Databases. Swarup Acharya,Viswanath Poosala,Sridhar Ramaswamy 1999 Selectivity estimation of queries is an important and well-studied problem in relational database systems. In this paper, we examine selectivity estimation in the context of Geographic Information Systems, which manage spatial data such as points, lines, poly-lines and polygons. In particular, we focus on point and range queries over two-dimensional rectangular data. We propose several techniques based on using spatial indices, histograms, binary space partitionings (BSPs), and the novel notion of spatial skew. Our techniques carefully partition the input rectangles into subsets and approximate each partition accurately. We present a detailed experimental study comparing the proposed techniques and the best known sampling and parametric techniques. We evaluate them using synthetic as well as real-life TIGER datasets. Based on our experiments, we identify a BSP based partitioning that we call Min-Skew which consistently provides the most accurate selectivity estimates for spatial queries. The Min-Skew partitioning can be constructed efficiently, occupies very little space, and provides accurate selectivity estimates over a broad range of spatial queries. SIGMOD Conference A Multimedia Presentation Algebra. Sibel Adali,Maria Luisa Sapino,V. S. Subrahmanian 1999 Over the last few years, there has been a tremendous increase in the number of interactive multimedia presentations prepared by different individuals and organizations. In this paper, we present an algebra for querying multimedia presentation databases. In contrast to the relational algebra, an algebra for interactive multimedia presentations must operate on trees whose branches reflect different possible playouts of a family of presentations. The query language supports selection type operations for locating objects and presentation paths that are of interest to the user, join type operations for combining presentations from multiple databases into a single presentation, and finally set theoretic operations for comparing different databases. The algebra operations can be used to locate presentations with specific properties and also for creating new presentations by borrowing different components from existing ones. We prove a host of equivalence results for queries in this algebra which may be used to build query optimizers for interactive presentation databases. SIGMOD Conference Nodose Version 2.0. Brad Adelberg,Matthew Denny 1999 This paper describes a tool, called Nodose, we have developed to expedite the creation of robust wrappers. Nodose allows non-programmers to build components that can convert data from the source format to XML or another generic format. Further, the generated code performs a set of statistical checks at runtime that attempt to find extraction errors before they are propogated back to users. SIGMOD Conference Fast Algorithms for Projected Clustering. Charu C. Aggarwal,Cecilia Magdalena Procopiuc,Joel L. Wolf,Philip S. Yu,Jong Soo Park 1999 Fast Algorithms for Projected Clustering. SIGMOD Conference A New Method for Similarity Indexing of Market Basket Data. Charu C. Aggarwal,Joel L. Wolf,Philip S. Yu 1999 In recent years, many data mining methods have been proposed for finding useful and structured information from market basket data. The association rule model was recently proposed in order to discover useful patterns and dependencies in such data. This paper discusses a method for indexing market basket data efficiently for similarity search. The technique is likely to be very useful in applications which utilize the similarity in customer buying behavior in order to make peer recommendations. We propose an index called the signature table, which is very flexible in supporting a wide range of similarity functions. The construction of the index structure is independent of the similarity function, which can be specified at query time. The resulting similarity search algorithm shows excellent scalability with increasing memory availability and database size. SIGMOD Conference DBIS-Toolkit: Adaptable Middleware for Large Scale Data Delivery. Mehmet Altinel,Demet Aksoy,Thomas Baby,Michael J. Franklin,William Shapiro,Stanley B. Zdonik 1999 DBIS-Toolkit: Adaptable Middleware for Large Scale Data Delivery. SIGMOD Conference Bottom-Up Computation of Sparse and Iceberg CUBEs. Kevin S. Beyer,Raghu Ramakrishnan 1999 We introduce the Iceberg-CUBE problem as a reformulation of the datacube (CUBE) problem. The Iceberg-CUBE problem is to compute only those group-by partitions with an aggregate value (e.g., count) above some minimum support threshold. The result of Iceberg-CUBE can be used (1) to answer group-by queries with a clause such as HAVING COUNT(*) >= X, where X is greater than the threshold, (2) for mining multidimensional association rules, and (3) to complement existing strategies for identifying interesting subsets of the CUBE for precomputation. We present a new algorithm (BUC) for Iceberg-CUBE computation. BUC builds the CUBE bottom-up; i.e., it builds the CUBE by starting from a group-by on a single attribute, then a group-by on a pair of attributes, then a group-by on three attributes, and so on. This is the opposite of all techniques proposed earlier for computing the CUBE, and has an important practical advantage: BUC avoids computing the larger group-bys that do not meet minimum support. The pruning in BUC is similar to the pruning in the Apriori algorithm for association rules, except that BUC trades some pruning for locality of reference and reduced memory requirements. BUC uses the same pruning strategy when computing sparse, complete CUBEs. We present a thorough performance evaluation over a broad range of workloads. Our evaluation demonstrates that (in contrast to earlier assumptions) minimizing the aggregations or the number of sorts is not the most important aspect of the sparse CUBE problem. The pruning in BUC, combined with an efficient sort method, enables BUC to outperform all previous algorithms for sparse CUBEs, even for computing entire CUBEs, and to dramatically improve Iceberg-CUBE computation. SIGMOD Conference Phoenix: Making Applications Robust. Roger S. Barga,David B. Lomet 1999 Phoenix: Making Applications Robust. SIGMOD Conference XML-Based Information Mediation with MIX. Chaitanya K. Baru,Amarnath Gupta,Bertram Ludäscher,Richard Marciano,Yannis Papakonstantinou,Pavel Velikhov,Vincent Chu 1999 "The MIX mediator system, MIXm, is developed as part of the MIX Project at the San Diego Supercomputer Center, and the University of California, San Diego.1 MIXm uses XML as the common model for data exchange. Mediator views are expressed in XMAS (XML Matching And Structuring Language), a declarative XML query language. To facilitate user-friendly query formulation and for optimization purposes, MIXm employs XML DTDs as a structural description (in effect, a “schema”) of the exchanged data. The novel features of the system include: Data exchange and integration solely relies on XML, i.e., instance and schema information is represented by XML documents and XML DTDs, respectively. XML queries are denoted in XMAS, which builds upon ideas of languages like XML-QL, MSL, Yat, and UnQL. Additionally, XMAS features powerful grouping and order constructs for generating new integrated XML “objects” from existing ones. The graphical user interface BBQ (Blended Browsing and Querying) is driven by the mediator view DTD and integrates browsing and querying of XML data. Complex queries can be constructed in an intuitive way, resembling QBE. Due to the nested nature of XML data and DTDs, BBQ provides graphical means to specify the nesting and grouping of query results. Query evaluation can be demand-driven, i.e., by the user's navigation into the mediated view." SIGMOD Conference The Jungle Database Search Engine. Michael H. Böhlen,Linas Bukauskas,Curtis E. Dyreson 1999 Information spread in in databases cannot be found by current search engines. A database search engine is capable to access and advertise database on the WWW. Jungle is a database search engine prototype developed at Aalborg University. Operating through JDBC connections to remote databases, Jungle extracts and indexes database data and meta-data, building a data store of database information. This information is used to evaluate and optimize queries in the AQUA query language. AQUA is a natural and intuitive database query language that helps users to search for information without knowing how that information is structured. This paper gives an overview of AQUA and describes the implementation of Jungle. SIGMOD Conference DataBlitz Storage Manager: Main Memory Database Performance for Critical Applications. Jerry Baulier,Philip Bohannon,S. Gogate,C. Gupta,S. Haldar,S. Joshi,A. Khivesera,Henry F. Korth,Peter McIlroy,J. Miller,P. P. S. Narayan,M. Nemeth,Rajeev Rastogi,S. Seshadri,Abraham Silberschatz,S. Sudarshan,M. Wilder,C. Wei 1999 DataBlitz Storage Manager: Main Memory Database Performance for Critical Applications. SIGMOD Conference A Comparison of Selectivity Estimators for Range Queries on Metric Attributes. Björn Blohsfeld,Dieter Korus,Bernhard Seeger 1999 In this paper, we present a comparison of nonparametric estimation methods for computing approximations of the selectivities of queries, in particular range queries. In contrast to previous studies, the focus of our comparison is on metric attributes with large domains which occur for example in spatial and temporal databases. We also assume that only small sample sets of the required relations are available for estimating the selectivity. In addition to the popular histogram estimators, our comparison includes so-called kernel estimation methods. Although these methods have been proven to be among the most accurate estimators known in statistics, they have not been considered for selectivity estimation of database queries, so far. We first show how to generate kernel estimators that deliver accurate approximate selectivities of queries. Thereafter, we reveal that two parameters, the number of samples and the so-called smoothing parameter, are important for the accuracy of both kernel estimators and histogram estimators. For histogram estimators, the smoothing parameter determines the number of bins (histogram classes). We first present the optimal smoothing parameter as a function of the number of samples and show how to compute approximations of the optimal parameter. Moreover, we propose a new selectivity estimator that can be viewed as an hybrid of histogram and kernel estimators. Experimental results show the performance of different estimators in practice. We found in our experiments that kernel estimators are most efficient for continuously distributed data sets, whereas for our real data sets the hybrid technique is most promising. SIGMOD Conference Versions and Workspaces in Microsoft Repository. Thomas Bergstraesser,Philip A. Bernstein,Shankar Pal,David Shutt 1999 This paper describes the version and workspace features of Microsoft Repository, a layer that implements fine-grained objects and relationships on top of Microsoft SQL Server. It supports branching and merging of versions, delta storage, checkout-checkin, and single-version views for version-unaware applications. SIGMOD Conference The Cornell Jaguar System: Adding Mobility to PREDATOR. Philippe Bonnet,Kyle Buza,Zhiyuan Chen,Victor Cheng,Randolph Chung,Takako M. Hickey,Ryan Kennedy,Daniel Mahashin,Tobias Mayr,Ivan Oprencak,Praveen Seshadri,Hubert Siu 1999 The Cornell Jaguar System: Adding Mobility to PREDATOR. SIGMOD Conference World Wide Database - Integrating the Web, CORBA, and Databases. Athman Bouguettaya,Boualem Benatallah,Lily Hendra,James Beard,Kevin Smith,Mourad Ouzzani 1999 World Wide Database - Integrating the Web, CORBA, and Databases. SIGMOD Conference Database Patchwork on the Internet. Reinhard Braumandl,Alfons Kemper,Donald Kossmann 1999 Naturally, data processing requires three kinds of resources: the data itself, the functionality (i.e. database operations) and the machines on which to run the operations. Because of the Internet we believe that in the long run there will be alternative providers for all of these three resources for any given application. Data providers will bring more and more data and more and more different kinds of data to the net. Likewise, function providers will develop new methods to process and work with the data; e.g., function providers might develop new algorithms to compress data or to produce thumbnails out of large images and try to sell these on the Internet. It is also conceivable, that some people allow other people to use spare cycles of their idle machines in the Internet (as in the Condor system of the University of Wisconsin) or that some companies (cycle providers) even specialize on selling computing time to businesses that occasionally need to carry out very complex operations for which regular hardware is not sufficient. At the University of Passau, we are currently developing a distributed database system to be used in the Internet. The goal is to ultimately have a system which is able to run on any machine, manage any kind of data, import any kind of data from other systems and import any kind of database operations. The system is entirely written in Java. One of the most important features of the system is that it is capable of dynamically loading (external) query operators, written in Java and supplied by any function provider, and executing these query operators in concert with pre-defined and other external operators in order to evaluate a query. Compared to object-relational database systems, which allow to integrate external data and functionality by the means of extensions (datablades, extenders or cartridges) or heterogeneous database systems such as Garlic [MS97] or Tsimmis [GMPQ+97], our approach makes it possible to place external query operators anywhere in a query evaluation plan as opposed to restricting the placement of external operations to the “access level” of plans. It would, for example, be possible to make our system execute a completely new relational join method, if somebody finds a new join method which is worth-while implementing. Because our system is written in Java, it is highly portable and could be used by data, function and cycle providers with almost no effort. Furthermore, our query engine is, of course, completely distributed providing all the required infrastructure for server-server communication, name services, etc. SIGMOD Conference Update Propagation Protocols For Replicated Databases. Yuri Breitbart,Raghavan Komondoor,Rajeev Rastogi,S. Seshadri,Abraham Silberschatz 1999 Replication is often used in many distributed systems to provide a higher level of performance, reliability and availability. Lazy replica update protocols, which propagate updates to replicas through independent transactions after the original transaction commits, have become popular with database vendors due to their superior performance characteristics. However, if lazy protocols are used indiscriminately, they can result in non-serializable executions. In this paper, we propose two new lazy update protocols that guarantee serializability but impose a much weaker requirement on data placement than earlier protocols. Further, many naturally occurring distributed systems, like distributed data warehouses, satisfy this requirement. We also extend our lazy update protocols to eliminate all requirements on data placement. The extension is a hybrid protocol that propagates as many updates as possible in a lazy fashion. We implemented our protocols on the Datablitz database system product developed at Bell Labs. We also conducted an extensive performance study which shows that our protocols outperform existing protocols over a wide range of workloads. SIGMOD Conference The CCUBE Constraint Object-Oriented Database System. Alexander Brodsky,Victor E. Segal,Jia Chen,Pavel A. Exarkhopoulo 1999 Constraints provide a flexible and uniform way to represent diverse data capturing spatio-temporal behavior, complex modeling requirements, partial and incomplete information etc, and have been used in a wide variety of application domains. Constraint databases have recently emerged to deeply integrate data captured by constraints in databases. This paper reports on the development of the first constraint object-oriented database system, CCUBE, and describes its specification, design and implementation. The CCUBE system is designed to be used for the implementation and optimization of high-level constraint object-oriented query languages as well as for directly building software systems requiring extensible use of constraint database features. The CCUBE data manipulation language, Constraint Comprehension Calculus, is an integration of a constraint calculus for extensible constraint domains within monoid comprehensions, which serve as an optimization-level language for object-oriented queries. The data model for the constraint calculus is based on constraint spatio-temporal (CST) objects that may hold spatial, temporal or constraint data, conceptually represented by constraints. New CST objects are constructed, manipulated and queried by means of the constraint calculus. The model for the monoid comprehensions, in turn, is based on the notion of monoids, which is a generalization of collection and aggregation types. The focal point of our work is achieving the right balance between the expressiveness, complexity and representation usefulness, without which the practical use of the system would not be possible. To that end, CCUBE constraint calculus guarantees polynomial time data complexity, and, furthermore, is tightly integrated with the monoid comprehensions to allow deeply interleaved global optimization. SIGMOD Conference Implementing the Spirit of SQL-99. Paul Brown 1999 This paper describes the current INFORMIX IDS/UD release (9.2 or Centaur) and compares and contrasts its functionality with the features of the SQL-99 language standard. INFORMIX and Illustra have been shipping DBMSs implementing the spirit of the SQL-99 standard for five years. In this paper, we review our experience working with ORDBMS technology, and argue that while SQL-99 is a huge improvement over SQL-92, substantial further work is necessary to make object-relational DBMSs truly useful. Specifically, we describe several interesting pieces of functionality unique to IDS/UD, and several dilemmas our customers have encountered that the standard does not address. SIGMOD Conference Automatic Discovery of Language Models for Text Databases. James P. Callan,Margaret E. Connell,Aiqun Du 1999 The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such as GIOSS can provide assistance by automatically selecting appropriate databases for an information need. Current practice is that each database provides its language model upon request, but this cooperative approach has important limitations. This paper demonstrates that cooperation is not required. Instead, the database selection service can construct its own language models by sampling database contents via the normal process of running queries and retrieving documents. Although random sampling is not possible, it can be approximated with carefully selected queries. This sampling approach avoids the limitations that characterize the cooperative approach, and also enables additional capabilities. Experimental results demonstrate that accurate language models can be learned from a relatively small number of queries and documents. SIGMOD Conference "O-O, What's Happening to DB2?" Michael J. Carey,Donald D. Chamberlin,Srinivasa Narayanan,Bennet Vance,Doug Doole,Serge Rielau,Richard Swagerman,Nelson Mendonça Mattos 1999 "O-O, What's Happening to DB2?" SIGMOD Conference Hypertext Databases and Data Mining. Soumen Chakrabarti 1999 The volume of unstructured text and hypertext data far exceeds that of structured data. Text and hypertext are used for digital libraries, product catalogs, reviews, newsgroups, medical reports, customer service reports, and the like. Currently measured in billions of dollars, the worldwide internet activity is expected to reach a trillion dollars by 2002. Database researchers have kept some cautious distance from this action. The goal of this tutorial is to expose database researchers to text and hypertext information retrieval (IR) and mining systems, and to discuss emerging issues in the overlapping areas of databases, hypertext, and data mining. SIGMOD Conference Efficient Concurrency Control in Multidimensional Access Methods. Kaushik Chakrabarti,Sharad Mehrotra 1999 The importance of multidimensional index structures to numerous emerging database applications is well established. However, before these index structures can be supported as access methods (AMs) in a “commercial-strength” database management system (DBMS), efficient techniques to provide transactional access to data via the index structure must be developed. Concurrent accesses to data via index structures introduce the problem of protecting ranges specified in the retrieval from phantom insertions and deletions (the phantom problem). This paper presents a dynamic granular locking approach to phantom protection in Generalized Search Trees(GiSTs), an index structure supporting an extensible set of queries and data types. The granular locking technique offers a high degree of concurrency and has a low lock overhead. Our experiments show that the granular locking technique (1) scales well under various system loads and (2) similar to the B-tree case, provides a significantly more efficient implementation compared to predicate locking for multidimensional AMs as well. Since a wide variety of multidimensional index structures can be implemented using GiST, the developed algorithms provide a general solution to concurrency control in multidimensional AMs. To the best of our knowledge, this paper provides the first such solution based on granular locking. SIGMOD Conference An Efficient Bitmap Encoding Scheme for Selection Queries. Chee Yong Chan,Yannis E. Ioannidis 1999 Bitmap indexes are useful in processing complex queries in decision support systems, and they have been implemented in several commercial database systems. A key design parameter for bitmap indexes is the encoding scheme, which determines the bits that are set to 1 in each bitmap in an index. While the relative performance of the two existing bitmap encoding schemes for simple selection queries of the form “v1 ≤ A ≤ v2” is known (specifically, one of the encoding schemes is better for processing equality queries; i.e., v1 = v2, while the other is better for processing range queries; i.e., v1 < v2), it remains an open question whether these two encoding schemes are indeed optimal for their respective query classes in the sense that there is no other encoding scheme with better space-time tradeoff. In this paper, we establish a number of optimality results for the existing encoding schemes; in particular, we prove that neither of the two known schemes is optimal for the class of two-sided range queries. We also propose a new encoding scheme and prove that it is optimal for that class. Finally, we present an experimental study comparing the performance of the new encoding scheme with that of the existing ones as well as four hybrid encoding schemes for both simple selection queries and the more general class of membership queries of the form “A ∈ {v1, v2, .…, vk}”. These results demonstrate that the new encoding scheme has an overall better space-time performance than existing schemes. SIGMOD Conference Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Kevin Chen-Chuan Chang,Hector Garcia-Molina 1999 Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. SIGMOD Conference On Random Sampling over Joins. Surajit Chaudhuri,Rajeev Motwani,Vivek R. Narasayya 1999 "A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree completely. We undertake a detailed study of this problem and attempt to analyze it in a variety of settings. We present theoretical results explaining the difficulty of this problem and setting limits on the efficiency that can be achieved. Based on new insights into the interaction between join and sampling, we develop join sampling techniques for the settings where our negative results do not apply. Our new sampling algorithms are significantly more efficient than those known earlier. We present experimental evaluation of our techniques on Microsoft's SQL Server 7.0." SIGMOD Conference A User-Centered Interface for Querying Distributed Multimedia Databases. Isabel F. Cruz,Kimberly M. James 1999 Facilitating information retrieval in the vastly growing realm of digital media has become increasingly difficult. DelaunayMM seeks to assist all users in finding relevant information through an interactive interface that supports pre- and post-query refinement, and a customizable multimedia information display. This project leverages the strengths of visual query languages with a resourceful framework to provide users with a single intuitive interface. The interface and its supporting framework are described in this paper. SIGMOD Conference A Layered Architecture for Querying Dynamic Web Content. Hasan Davulcu,Juliana Freire,Michael Kifer,I. V. Ramakrishnan 1999 The design of webbases, database systems for supporting Web-based applications, is currently an active area of research. In this paper, we propose a 3-year architecture for designing and implementing webbases for querying dynamic Web content(i.e., data that can only be extracted by filling out multiple forms). The lowest layer, virtual physical layer, provides navigation independence by shielding the user from the complexities associated with retrieving data from raw Web sources. Next, the traditional logical layer supports site independence. The top layer is analogous to the external schema layer in traditional databases. Within this architectural framework we address two problems unique to webbases — retrieving dynamic Web content in the virtual physical layer and querying of the external schema by the end user. The layered architecture makes it possible to automate data extraction to a much greater degree than in existing proposals. Wrappers for the virtual physical schema can be created semi-automatically, by asking the webbase designer to navigate through the sites of interest — we call this approach mapping by example. Thus, the webbase designer need not have expertise in the language that maps the physical schema to the raw Web (this should be contrasted to other approaches, which require expertise in various Web-enabled flavors of SQL). For the external schema layer, we propose a semantic extension of the universal relation interface. This interface provides powerful, yet reasonably simple, ad hoc querying capabilities for the end user compared to the currently prevailing “canned” form-based interfaces on the one hand or complex Web-enabling extensions of SQL on the other. Finally, we discuss the implementation of the proposed architecture. SIGMOD Conference Storing Semistructured Data with STORED. Alin Deutsch,Mary F. Fernández,Dan Suciu 1999 Systems for managing and querying semistructured-data sources often store data in proprietary object repositories or in a tagged-text format. We describe a technique that can use relational database management systems to store and manage semistructured data. Our technique relies on a mapping between the semistructured data model and the relational data model, expressed in a query language called STORED. When a semistructured data instance is given, a STORED mapping can be generated automatically using data-mining techniques. We are interested in applying STORED to XML data, which is an instance of semistructured data. We show how a document-type-descriptor (DTD), when present, can be exploited to further improve performance. SIGMOD Conference The Need for Distributed Asynchronous Transactions. Lyman Do,Prabhu Ram,Pamela Drew 1999 The theme of the paper is to promote research on asynchronous transactions. We discuss our experience of executing synchronous transactions on a large distributed production system in The Boeing Company. Due to the poor performance of synchronous transactions in our environment, it motivated the exploration of asynchronous transactions as an alternate solution. This paper presents the requirements and benefits/limitations of asynchronous transactions. Open issues related to large scale deployments of asynchronous transactions are also discussed. SIGMOD Conference Petabyte Databases. Dirk Düllmann 1999 This paper describes the use of Object-Database Management Systems (ODBMS)for the storage of High-Energy Physics (HEP) data. SIGMOD Conference Record-Boundary Discovery in Web Documents. David W. Embley,Y. S. Jiang,Yiu-Kai Ng 1999 Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted). SIGMOD Conference Query Optimization in the Presence of Limited Access Patterns. Daniela Florescu,Alon Y. Levy,Ioana Manolescu,Dan Suciu 1999 We consider the problem of query optimization in the presence of limitations on access patterns to the data (i.e., when one must provide values for one of the attributes of a relation in order to obtain tuples). We show that in the presence of limited access patterns we must search a space of annotated query plans, where the annotations describe the inputs that must be given to the plan. We describe a theoretical and experimental analysis of the resulting search space and a novel query optimization algorithm that is designed to perform well under the different conditions that may arise. The algorithm searches the set of annotated query plans, pruning invalid and non-viable plans as early as possible in the search space, and it also uses a best-first search strategy in order to produce a first complete plan early in the search. We describe experiments to illustrate the performance of our algorithm. SIGMOD Conference Query Optimization for Selections Using Bitmaps. Ming-Chuan Wu 1999 Bitmaps are popular indexes for data warehouse (DW) applications and most database management systems offer them today. This paper proposes query optimization strategies for selections using bitmaps. Both continuous and discrete selection criteria are considered. Query optimization strategies are categorized into static and dynamic. Static optimization strategies discussed are the optimal design of bitmaps, and algorithms based on tree and logical reduction. The dynamic optimization discussed is the approach of inclusion and exclusion for both bit-sliced indexes and encoded bitmap indexes. SIGMOD Conference Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? (Panel). Minos N. Garofalakis,Sridhar Ramaswamy,Rajeev Rastogi,Kyuseok Shim 1999 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? (Panel). SIGMOD Conference Daytona And The Fourth-Generation Language Cymbal. Rick Greer 1999 "The Daytona™ data management system is used by AT&T to solve a wide spectrum of data management problems. For example, Daytona is managing a 4 terabyte data warehouse whose largest table contains over 10 billion rows. Daytona's architecture is based on translating its high-level query language Cymbal (which includes SQL as a subset) completely into C and then compiling that C into object code. The system resulting from this architecture is fast, powerful, easy to use and administer, reliable and open to UNIX™ tools. In particular, two forms of data compression plus robust horizontal partitioning enable Daytona to handle terabytes with ease." SIGMOD Conference BOAT-Optimistic Decision Tree Construction. Johannes Gehrke,Venkatesh Ganti,Raghu Ramakrishnan,Wei-Yin Loh 1999 BOAT-Optimistic Decision Tree Construction. SIGMOD Conference Ripple Joins for Online Aggregation. Peter J. Haas,Joseph M. Hellerstein 1999 We present a new family of join algorithms, called ripple joins, for online processing of multi-table aggregation queries in a relational database management system (DBMS). Such queries arise naturally in interactive exploratory decision-support applications. Traditional offline join algorithms are designed to minimize the time to completion of the query. In contrast, ripple joins are designed to minimize the time until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval. Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data. Ripple joins also permit the user to dynamically trade off the two key performance factors of on-line aggregation: the time between successive updates of the running aggregate, and the amount by which the confidence-interval length decreases at each update. We show how ripple joins can be implemented in an existing DBMS using iterators, and we give an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join “aspect-ratio” parameters. In experiments with an initial implementation of our algorithms in the POSTGRES DBMS, the time required to produce reasonably precise online estimates was up to two orders of magnitude smaller than the time required for the best offline join algorithms to produce exact answers. SIGMOD Conference "Merge Replication in Microsoft's SQL Server 7.0." Brad Hammond 1999 SQL Server 7.0 offers three different styles of replication that we call Transactional Replication, Snapshot Replication, and Merge Replication. Merge Replication means that data changes can be performed at any replica, and that the changes performed at multiple replicas are later merged together. Because Merge Replication allows updates to disconnected replicas, it is particularly well suited to applications that require a lot of autonomy. A special process called the Merge Agent propagates changes between replicas, filters data as appropriate, and detects and handles conflicts according to user-specified rules. SIGMOD Conference Online Association Rule Mining. Christian Hidber 1999 "We present a novel algorithm to compute large itemsets online. It needs at most 2 scans of the transaction sequence. During the first scan the user is free to change the support threshold. The algorithm maintains a superset of all large itemsets and a deterministic lower and upper bound on the support of each itemset. We continously display the resulting association rules along with an interval on the rule''s support and confidence. The algorithm can compute association rules for a transaction sequence which is read from a network and is too large to be stored locally for a rescan. During the second scan we determine the precise support for each large itemset and prune all small itemsets using a new forward-pruning technique." SIGMOD Conference Clustering Methods for Large Databases: From the Past to the Future. Alexander Hinneburg,Daniel A. Keim 1999 Clustering Methods for Large Databases: From the Past to the Future. SIGMOD Conference An Adaptive Query Execution System for Data Integration. Zachary G. Ives,Daniela Florescu,Marc Friedman,Alon Y. Levy,Daniel S. Weld 1999 Query processing in data integration occurs over network-bound, autonomous data sources. This requires extensions to traditional optimization and execution techniques for three reasons: there is an absence of quality statistics about the data, data transfer rates are unpredictable and bursty, and slow or unavailable data sources can often be replaced by overlapping or mirrored sources. This paper presents the Tukwila data integration system, designed to support adaptivity at its core using a two-pronged approach. Interleaved planning and execution with partial optimization allows Tukwila to quickly recover from decisions based on inaccurate estimates. During execution, Tukwila uses adaptive query operators such as the double pipelined hash join, which produces answers quickly, and the dynamic collector, which robustly and efficiently computes unions across overlapping data sources. We demonstrate that the Tukwila architecture extends previous innovations in adaptive execution (such as query scrambling, mid-execution re-optimization, and choose nodes), and we present experimental evidence that our techniques result in behavior desirable for a data integration system. SIGMOD Conference Querying Network Directories. H. V. Jagadish,Laks V. S. Lakshmanan,Tova Milo,Divesh Srivastava,Dimitra Vista 1999 Heirarchically structured directories have recently proliferated with the growth of the Internet, and are being used to store not only address books and contact information for people, but also personal profiles, network resource information, and network and service policies. These systems provide a means for managing scale and heterogeneity, while allowing for conceptual unity and autonomy across multiple directory servers in the network, in a way for superior to what conventional relational or object-oriented databases offer. Yet, in deployed systems today, much of the data is modeled in an ad hoc manner, and many of the more sophisticated “queries” involve navigational access. In this paper, we develop the core of a formal data model for network directories, and propose a sequence of efficiently computable query languages with increasing expressive power. The directory data model can naturally represent rich forms of heterogeneity exhibited in the real world. Answers to queries expressible in our query languages can exhibit the same kinds of heterogeneity. We present external memory algorithms for the evaluation of queries posed in our directory query languages, and prove the efficiency of each algorithm in terms of its I/O complexity. Our data model and query languages share the flexibility and utility of the recent proposals for semi-structured data models, while at the same time effectively addressing the specific needs of network directory applications, which we demonstrate by means of a representative real-life example. SIGMOD Conference Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. H. V. Jagadish,Laks V. S. Lakshmanan,Divesh Srivastava 1999 Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes. Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data. SIGMOD Conference Belief Reasoning in MLS Deductive Databases. Hasan M. Jamil 1999 It is envisaged that the application of the multilevel security (MLS) scheme will enhance flexibility and effectiveness of authorization policies in shared enterprise databases and will replace cumbersome authorization enforcement practices through complicated view definitions on a per user basis. However, as advances in this area are being made and ideas crystallized, the concomitant weaknesses of the MLS databases are also surfacing. We insist that the critical problem with the current model is that the belief at a higher security level is cluttered with irrelevant or inconsistent data as no mechanism for attenuation is supported. Critics also argue that it is imperative for MLS database users to theorize about the belief of others, perhaps at different security levels, an apparatus that is currently missing and the absence of which is seriously felt. The impetus for our current research is this need to provide an adequate framework for belief reasoning in MLS databases. We demonstrate that a prudent application of the concept of inheritance in a deductive database setting will help capture the notion of declarative belief and belief reasoning in MLS databases in an elegant way. To this end, we develop a function to compute belief in multiple modes which can be used to reason about the beliefs of other users. We strive to develop a poised and practical logical characterization of MLS databases for the first time based on the inherently difficult concept of non-monotonic inheritance. We present an extension of the acclaimed Datalog language, called the MultiLog, and show that Datalog is a special case of our language. We also suggest an implementation scheme for MultiLog as a front-end for CORAL. SIGMOD Conference Improving OLTP Data Quality Using Data Warehouse Mechanisms. Matthias Jarke,Christoph Quix,Guido Blees,Dirk Lehmann,Gunter Michalk,Stefan Striel 1999 Research and products for the integration of heterogeneous legacy source databases in data warehousing have addressed numerous data quality problems in or between the sources. Such a solution is marketed by Team4 for the decision support of mobile sales representatives, using advanced view maintenance and replication management techniques in an environment based on relational data warehouse technology and Lotus Notes-based client systems. However, considering total information supply chain management, the capture of poor operational data, to be cleaned later in the data warehouse, appears sub-optimal. Based on the observation that decision support clients are often closely linked to operational data entry, we have addressed the problem of mapping the data warehouse data quality techniques back to data quality measures for improving OLTP data. The solution requires a warehouse-to-OLTP workflow which employs a combination of view maintenance and view update techniques. SIGMOD Conference Indexing Medium-dimensionality Data in Oracle. Kothuri Venkata Ravi Kanth,Siva Ravada,Jayant Sharma,Jay Banerjee 1999 Indexing Medium-dimensionality Data in Oracle. SIGMOD Conference Efficient Geometry-based Similarity Search of 3D Spatial Databases. Daniel A. Keim 1999 Searching a database of 3D-volume objects for objects which are similar to a given 3D search object is an important problem which arises in number of database applications — for example, in Medicine and CAD. In this paper, we present a new geometry-based solution to the problem of searching for similar 3D-volume objects. The problem is motivated from a real application in the medical domain where volume similarity is used as a basis for surgery decisions. Our solution for an efficient similarity search on large databases of 3D volume objects is based on a new geometric index structure. The basic idea of our new approach is to use the concept of hierarchical approximations of the 3D objects to speed up the search process. We formally show the correctness of our new approach and introduce two instantiations of our general idea, which are based on cuboid and octree approximations. We finally provide a performance evaluation of our new index structure revealing significant performance improvements over existing approaches. SIGMOD Conference EMC Information Sharing: Direct Access to MVS Data from Unix and NT. Walt Kohler 1999 "In this extended abstract we briefly describe EMC's information sharing technology that enables UNIX and NT systems to directly access MVS mainframe datasets and how this technology can be used to directly access an MVS DB2 database." SIGMOD Conference DynaMat: A Dynamic View Management System for Data Warehouses. Yannis Kotidis,Nick Roussopoulos 1999 "Pre-computation and materialization of views with aggregate functions is a common technique in Data Warehouses. Due to the complex structure of the warehouse and the different profiles of the users who submit queries, there is need for tools that will automate the selection and management of the materialized data. In this paper we present DynaMat, a system that dynamically materializes information at multiple levels of granularity in order to match the demand (workload) but also takes into account the maintenance restrictions for the warehouse, such as down time to update the views and space availability. DynaMat unifies the view selection and the view maintenance problems under a single framework using a novel “goodness” measure for the materialized views. DynaMat constantly monitors incoming queries and materializes the best set of views subject to the space constraints. During updates, DynaMat reconciles the current materialized view selection and refreshes the most beneficial subset of it within a given maintenance window. We compare DynaMat against a system that is given all queries in advance and the pre-computed optimal static view selection. The comparison is made based on a new metric, the Detailed Cost Savings Ratio introduced for quantifying the benefits of view materialization against incoming queries. These experiments show that DynaMat's dynamic view selection outperforms the optimal static view selection and thus, any sub-optimal static algorithm that has appeared in the literature." SIGMOD Conference Bringing Object-Relational Technology to Mainstream. Vishu Krishnamurthy,Sandeepan Banerjee,Anil Nori 1999 Bringing Object-Relational Technology to Mainstream. SIGMOD Conference Shrinking the Warehouse Update Window. Wilburt Labio,Ramana Yerneni,Hector Garcia-Molina 1999 Warehouse views need to be updated when source data changes. Due to the constantly increasing size of warehouses and the rapid rates of change, there is increasing pressure to reduce the time taken for updating the warehouse views. In this paper we focus on reducing this “update window” by minimizing the work required to compute and install a batch of updates. Various strategies have been proposed in the literature for updating a single warehouse view. These algorithms typically cannot be extended to come up with good strategies for updating an entire set of views. We develop an efficient algorithm that selects an optimal update strategy for any single warehouse view. Based on this algorithm, we develop an algorithm for selecting strategies to update a set of views. The performance of these algorithms is studied with experiments involving warehouse views based on TPC-D queries. SIGMOD Conference Optimization of Constrained Frequent Set Queries with 2-variable Constraints. Laks V. S. Lakshmanan,Raymond T. Ng,Jiawei Han,Alex Pang 1999 Currently, there is tremendous interest in providing ad-hoc mining capabilities in database management systems. As a first step towards this goal, in [15] we proposed an architecture for supporting constraint-based, human-centered, exploratory mining of various kinds of rules including associations, introduced the notion of constrained frequent set queries (CFQs), and developed effective pruning optimizations for CFQs with 1-variable (1-var) constraints. While 1-var constraints are useful for constraining the antecedent and consequent separately, many natural examples of CFQs illustrate the need for constraining the antecedent and consequent jointly, for which 2-variable (2-var) constraints are indispensable. Developing pruning optimizations for CFQs with 2-var constraints is the subject of this paper. But this is a difficult problem because: (i) in 2-var constraints, both variables keep changing and, unlike 1-var constraints, there is no fixed target for pruning; (ii) as we show, “conventional” monotonicity-based optimization techniques do not apply effectively to 2-var constraints. The contributions are as follows. (1) We introduce a notion of quasi-succinctness, which allows a quasi-succinct 2-var constraint to be reduced to two succinct 1-var constraints for pruning. (2) We characterize the class of 2-var constraints that are quasi-succinct. (3) We develop heuristic techniques for non-quasi-succinct constraints. Experimental results show the effectiveness of all our techniques. (4) We propose a query optimizer for CFQs and show that for a large class of constraints, the computation strategy generated by the optimizer is ccc-optimal, i.e., minimizing the effort incurred w.r.t. constraint checking and support counting. SIGMOD Conference Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. Ju-Hong Lee,Deok-Hwan Kim,Chin-Wan Chung 1999 The database query optimizer requires the estimation of the query selectivity to find the most efficient access plan. For queries referencing multiple attributes from the same relation, we need a multi-dimensional selectivity estimation technique when the attributes are dependent each other because the selectivity is determined by the joint data distribution of the attributes. Additionally, for multimedia databases, there are intrinsic requirements for the multi-dimensional selectivity estimation because feature vectors are stored in multi-dimensional indexing trees. In the 1-dimensional case, a histogram is practically the most preferable. In the multi-dimensional case, however, a histogram is not adequate because of high storage overhead and high error rates. In this paper, we propose a novel approach for the multi-dimensional selectivity estimation. Compressed information from a large number of small-sized histogram buckets is maintained using the discrete cosine transform. This enables low error rates and low storage overheads even in high dimensions. In addition, this approach has the advantage of supporting dynamic data updates by eliminating the overhead for periodical reconstructions of the compressed information. Extensive experimental results show advantages of the proposed approach. SIGMOD Conference Logical Logging to Extend Recovery to New Domains. David B. Lomet,Mark R. Tuttle 1999 "Recovery can be extended to new domains at reduced logging cost by exploiting “logical” log operations. During recovery, a logical log operation may read data values from any recoverable object, not solely from values on the log or from the updated object. Hence, we needn't log these values, a substantial saving. In [8], we developed a redo recovery theory that deals with general log operations and proved that the stable database remains recoverable when it is explained in terms of an installation graph. This graph was used to derived a write graph that determines a flush order for cached objects that ensures that the database remains recoverable. In this paper, we introduce a refined write graph that permits more flexible cache management that flushes smaller sets of objects. Using this write graph, we show how: (i) the cache manager can inject its own operations to break up atomic flush sets; and (ii) the recovery process can avoid redoing operations whose effects aren't needed by exploiting generalized recovery LSNs. These advances permit more cost-effective recovery for, e.g., files and applications." SIGMOD Conference PowerBookmarks: A System for Personalizable Web Information Organization, Sharing, and Management. Wen-Syan Li,Quoc Vu,Edward Y. Chang,Divyakant Agrawal,Kyoji Hirata,Sougata Mukherjea,Yi-Leh Wu,Corey Bufi,Kevin Chen-Chuan Chang,Yoshinori Hara,Reiko Ito,Yutaka Kimura,Kazuyuki Shimazu,Yukiyoshi Saito 1999 PowerBookmarks: A System for Personalizable Web Information Organization, Sharing, and Management. SIGMOD Conference An XML-based Wrapper Generator for Web Information Extraction. Ling Liu,Wei Han,David Buttler,Calton Pu,Wei Tang 1999 An XML-based Wrapper Generator for Web Information Extraction. SIGMOD Conference Integration of Spatial Join Algorithms for Processing Multiple Inputs. Nikos Mamoulis,Dimitris Papadias 1999 Several techniques that compute the join between two spatial datasets have been proposed during the last decade. Among these methods, some consider existing indices for the joined inputs, while others treat datasets with no index, providing solutions for the case where at least one input comes as an intermediate result of another database operator. In this paper we analyze previous work on spatial joins and propose a novel algorithm, called slot index spatial join (SISJ), that efficiently computes the spatial join between two inputs, only one of which is indexed by an R-tree. Going one step further, we show how SISJ and other spatial join algorithms can be implemented as operators in a database environment that joins more than two spatial datasets. We study the differences between relational and spatial multiway joins, and propose a dynamic programming algorithm that optimizes the execution of complex spatial queries. SIGMOD Conference Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. Gurmeet Singh Manku,Sridhar Rajagopalan,Bruce G. Lindsay 1999 In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median. SIGMOD Conference Query Processing Techniques for Arrays. Arunprasad P. Marathe,Kenneth Salem 1999 Arrays are an appropriate data model for images, gridded output from computational models, and other types of data. This paper describes an approach to array query processing. Queries are expressed in AML, a logical algebra that is easily extended with user-defined functions to support a wide variety of array operations. For example, compression, filtering, and algebraic operations on images can be described. We show how AML expressions involving such operations can be treated declaratively and subjected to useful rewrite optimizations. We also describe a plan generator that produces efficient iterator-based plans from rewritten AML expressions. SIGMOD Conference Client-Site Query Extensions. Tobias Mayr,Praveen Seshadri 1999 "We explore the execution of queries with client-site user-defined functions (UDFs). Many UDFs can only be executed at the client site, for reasons of scalability, security, confidentiality, or availability of resources. How should a query with client-site UDFs be executed? We demonstrate that the standard execution technique for server-site UDFs performs poorly. Instead, we adapt well-known distributed database algorithms and apply them to client-site UDFs. The resulting query execution techniques are implemented in the Cornell Predator database system, and we present performance results to demonstrate their effectiveness. We also reconsider the question of query optimization in the context of client-site UDFs. The known techniques for expensive UDFs are inadequate because they do not take the location of the UDF into account. We present an extension of traditional ''System-R'' optimizers that suitably optimize queries with client-site operations." SIGMOD Conference A Database Perspective on Lotus Domino/Notes. C. Mohan 1999 In this one-page summary, I introduce the database aspects of Lotus Domino/Notes. These database features are covered in detail in the corresponding SIGMOD99 tutorial available at www.almaden.ibm.com/u/mohan/domino_sigmod99.ps. SIGMOD Conference WALRUS: A Similarity Retrieval Algorithm for Image Databases. Apostol Natsev,Rajeev Rastogi,Kyuseok Shim 1999 "Abstract--Approaches for content-based image querying typically extract a single signature from each image based on color, texture, or shape features. The images returned as the query result are then the ones whose signatures are closest to the signature of the query image. While efficient for simple images, such methods do not work well for complex scenes since they fail to retrieve images that match the query only partially, that is, only certain regions of the image match. This inefficiency leads to the discarding of images that may be semantically very similar to the query image since they may contain the same objects. The problem becomes even more apparent when we consider scaled or translated versions of the similar objects. In this paper, we propose WALRUS (WAveLet-based Retrieval of User-specified Scenes), a novel similarity retrieval algorithm that is robust to scaling and translation of objects within an image. WALRUS employs a novel similarity model in which each image is first decomposed into its regions and the similarity measure between a pair of images is then defined to be the fraction of the area of the two images covered by matching regions from the images. In order to extract regions for an image, WALRUS considers sliding windows of varying sizes and then clusters them based on the proximity of their signatures. An efficient dynamic programming algorithm is used to compute wavelet-based signatures for the sliding windows. Experimental results on real-life data sets corroborate the effectiveness of WALRUS's similarity model." SIGMOD Conference Exploratory Mining via Constrained Frequent Set Queries. Raymond T. Ng,Laks V. S. Lakshmanan,Jiawei Han,Teresa Mah 1999 Although there have been many studies on data mining, to date there have been few research prototypes or commercial systems supporting comprehensive query-driven mining, which encourages interactive exploration of the data. Our thesis is that constraint constructs and the optimization they induce play a pivotal role in mining queries, thus substantially enhancing the usefulness and performance of the mining system. This is based on the analogy of declarative query languages like SQL and query optimization which have made relational databases so successful. To this end, our proposed demo is not yet another data mining system, but of a new paradigm in data mining - mining with constraints, as the important first step towards supporting ad-hoc mining in DBMS. In this demo, we will show a prototype exploratory mining system that implements constraint-based mining query optimization methods proposed in [5]. We will demonstrate how a user can interact with the system for exploratory data mining and how efficiently the system may execute optimized data mining queries. The prototype system will include all the constraint pushing techniques for mining association rules outlined in [5], and will include additional capabilities for mining other kinds of rules for which the computation of constrained frequent sets forms the core first step. SIGMOD Conference Microsoft Site Server (Commerce Edition). Bassel Ojjeh 1999 Microsoft Site Server (Commerce Edition). SIGMOD Conference Data Management Issues in Electronic Commerce (Panel). M. Tamer Özsu 1999 Data Management Issues in Electronic Commerce (Panel). SIGMOD Conference Query Rewriting for Semistructured Data. Yannis Papakonstantinou,Vasilis Vassalos 1999 We address the problem of query rewriting for TSL, a language for querying semistructured data. We develop and present an algorithm that, given a semistructured query q and a set of semistructured views V, finds rewriting queries, i.e., queries that access the views and produce the same result as q. Our algorithm is based on appropriately generalizing containment mappings, the chase, and query composition — techniques that were developed for structured, relational data. We also develop an algorithm for equivalence checking of TSL queries. We show that the algorithm is sound and complete for TSL, i.e., it always finds every non-trivial TSL rewriting query of q, and we discuss its complexity. We extend the rewriting algorithm to use some forms of structural constraints (such as DTDs) and find more opportunities for query rewriting. SIGMOD Conference E-Commerce Database Issues and Experience. Anand Rajaraman 1999 E-Commerce Database Issues and Experience. SIGMOD Conference The Active MultiSync Controller of the Cubetree Storage Organization. Nick Roussopoulos,Yannis Kotidis,Yannis Sismanis 1999 The Cubetree Storage Organization (CSO)1 logically and physically clusters materialized-views data, multi-dimensional indices on them, and computed aggregate values all in one compact and tight storage structure that uses a fraction of the conventional table-based space. This is a breakthrough technology for storing and accessing multi-dimensional data in terms of storage reduction, query performance and incremental bulk update speed. CSO has been extended with an Active MultiSync controller for synchronizing multiple concurrent access and continuous asynchronous online updates for a non-stop data warehouse. SIGMOD Conference "``Honey, I Shrunk the DBMS'': Footprint, Mobility, and Beyond (Panel)." Praveen Seshadri 1999 "``Honey, I Shrunk the DBMS'': Footprint, Mobility, and Beyond (Panel)." SIGMOD Conference SERF: ODMG-Based Generic Re-structuring Facility. Elke A. Rundensteiner,Kajal T. Claypool,Ming Li,Li Chen,Xin Zhang,Chandrakant Natarajan,Jing Jin,Stacia De Lima,S. Weiner 1999 The age of information management and with it the advent of increasingly sophisticated technologies have kindled a need in the database community and others to re-structure existing systems and move forward to make use of these new technologies. Legacy application systems are being transformed to newer state-of-the-art systems, information sources are being mapped from one data model to another, a diversity of data sources are being transformed to load, cleanse and consolidate data into modern data-warehouses [CR99]. Re-structuring is thus a critical task for a variety of applications. For this reason, most object-oriented database systems (OODB) today support some form of re-structuring support [Tec94, Obj93, BKKK87]. This existing support of current OODBs [BKKK87, Tec94, Obj93] is limited to a pre-defined taxonomy of simple fixed-semantic schema evolution operations. However, such simple changes, typically to individual types only, are not sufficient for many advanced applications [Bré96]. More radical changes, such as combining two types of redefining the relationship between two types, are either very difficult or even impossible to achieve with current commercial database technology [Tec94, Obj93]. In fact, most OODBs would typically require the user to write ad-hoc programs to accomplish such transformations. Research that has begun to look into the issue of complex changes [Bré96, Ler96] is still limited by providing a fixed set of some selected (even if now more complex) operations. To address these limitations of the current restructuring technology, we have proposed the SERF framework which aims at providing a rich environment for doing complex user-defined transformations flexibly, easily and correctly [CJR98b]. The goal of our work is to increase the usability and utility of the SERF framework and its applicability to re-structuring problems beyond OODB evolution. Towards that end, we provide re-usable transformations via the notion of SERF Templates that can be packaged into libraries, thereby increasing the portability of these transformations. We also now have a first cut at providing an assurance of consistency for the users of this system, a semantic optimizer that provides some performance improvements via enhanced query optimization techniques with emphasis on the re-structuring primitives [CNR99]. In this demo we give an overview of the SERF framework, its current status and the enhancements that are planned for the future. We also present an example of the application of SERF to a domain other than schema evolution, i.e., the web restructuring. SIGMOD Conference Evolvable View Environment (EVE): Non-Equivalent View Maintenance under Schema Changes. Elke A. Rundensteiner,Andreas Koeller,Xin Zhang,Amber van Wyk,Yong Li,Amy J. Lee,Anisoara Nica 1999 Evolvable View Environment (EVE): Non-Equivalent View Maintenance under Schema Changes. SIGMOD Conference Efficient Concurrency Control for Broadcast Environments. Jayavel Shanmugasundaram,Arvind Nithrakashyap,Rajendran M. Sivasankaran,Krithi Ramamritham 1999 A crucial consideration in environments where data is broadcast to clients is the low bandwidth available for clients to communicate with servers. Advanced applications in such environments do need to read data that is mutually consistent as well as current. However, given the asymmetric communication capabilities and the needs of clients in mobile environments, traditional serializability-based approaches are too restrictive, unnecessary, and impractical. We thus propose the use of a weaker correctness criterion called update consistency and outline mechanisms based on this criterion that ensure (1) the mutual consistency of data maintained by the server and read by clients, and (2) the currency of data read by clients. Using these mechanisms, clients can obtain data that is current and mutually consistent “off the air”, i.e., without contacting the server to, say, obtain locks. Experimental results show a substantial reduction in response times as compared to existing (serializability-based) approaches. A further attractive feature of the approach is that if caching is possible at a client, weaker forms of currency can be obtained while still satisfying the mutual consistency of data. SIGMOD Conference In-Memory Data Management for Consumer Transactions The Times-Ten Approach. Times-Ten Team 1999 In-Memory Data Management for Consumer Transactions The Times-Ten Approach. SIGMOD Conference Managing Web Data. Dan Suciu 1999 Managing Web Data. SIGMOD Conference Data Integration and Warehousing in Telecom Italia. Stefano Trisolini,Maurizio Lenzerini,Daniele Nardi 1999 We discuss the main methodological and technological issues arosen in the last years in the development of the enterprise integrated database of Telecom Italia and, subsequently in the management of the primary data store for Telecom Italia data warehouse applications. The two efforts, although driven by different needs and requirements can be regarded as a continous development of an integrated view of the enterprise data. We review the experience accumulated in the integration of over 50 internal databases, highlighting the benefits and drawbacks of this scenario for data warehousing and discuss the development of a large dedicated data store to support the analysis of data about customers and phone traffic. SIGMOD Conference Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. Jeffrey Scott Vitter,Min Wang 1999 Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries. In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a smalll number of I/Os, depending upon the desired accuracy. We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling. SIGMOD Conference The WASA2 Object-Oriented Workflow Management System. Gottfried Vossen,Mathias Weske 1999 The WASA2 Object-Oriented Workflow Management System. SIGMOD Conference DOMINO: Databases fOr MovINg Objects tracking. Ouri Wolfson,A. Prasad Sistla,Bo Xu,Jutai Zhou,Sam Chamberlain 1999 Consider a database that represents information about moving objects and their location. For example, for a database representing the location of taxi-cabs a typical query may be: retrieve the free cabs that are currently within 1 mile of 33 N. Michigan Ave., Chicago (to pick-up a customer); or for a trucking company database a typical query may be: retrieve the trucks that are currently within 1 mile of truck ABT312 (which needs assistance); or for a database representing the current location of objects in a battlefield a typical query may be: retrieve the friendly helicopters that are in a given region, or, retrieve the friendly helicopters that are expected to enter the region within the next 10 minutes. The queries may originate from the moving objects, or from stationary users. We will refer to applications with the above characteristics as moving-objects-database (MOD) applications, and to queries as the ones mentioned above as MOD queries. In the military MOD applications arise in the context of the digital battlefield (see [1]), and in the civilian industry they arise in transportation systems. For example, Omnitracs developed by Qualcomm (see[2]) is a commercial system used by the transportation industry, which enables MOD functionality. It provides location management by connecting vehicles (e.g. trucks), via satellites, to company databases. The vehicles are equipped with a Global Positioning System (GPS), and they automatically and periodically report their location. SIGMOD Conference Computing Capabilities of Mediators. Ramana Yerneni,Chen Li,Hector Garcia-Molina,Jeffrey D. Ullman 1999 Existing data-integration systems based on the mediation architecture employ a variety of mechanisms to describe the query-processing capabilities of sources. However, these systems do not compute the capabilities of the mediators based on the capabilities of the sources they integrate. In this paper, we proposed a framework to capture a rich variety of query-processing capabilities of data sources and mediators. We present algorithms to compute the set of supported queries of a mediator, based on the capability limitations of its sources. Our algorithms take into consideration a variety of query-processing techniques employed by mediators to enhance the set of supported queries. SIGMOD Conference TAM: A System for Dynamic Transactional Activity Management. Tong Zhou,Ling Liu,Calton Pu 1999 TAM: A System for Dynamic Transactional Activity Management. VLDB XML Repository and Active Views Demonstration. Serge Abiteboul,Vincent Aguilera,Sébastien Ailleret,Bernd Amann,Sophie Cluet,Brendan Hills,Frédéric Hubert,Jean-Claude Mamou,Amélie Marian,Laurent Mignet,Tova Milo,Cassio Souza dos Santos,Bruno Tessier,Anne-Marie Vercoustre 1999 XML Repository and Active Views Demonstration. VLDB Active Views for Electronic Commerce. Serge Abiteboul,Bernd Amann,Sophie Cluet,Anat Eyal,Laurent Mignet,Tova Milo 1999 Active Views for Electronic Commerce. VLDB Aqua: A Fast Decision Support Systems Using Approximate Query Answers. Swarup Acharya,Phillip B. Gibbons,Viswanath Poosala 1999 Aqua: A Fast Decision Support Systems Using Approximate Query Answers. VLDB DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki,David J. DeWitt,Mark D. Hill,David A. Wood 1999 DBMSs on a Modern Processor: Where Does Time Go? VLDB A Scalable and Highly Available Networked Database Architecture. Roger Bamford,Rafiul Ahad,Angelo Pruscino 1999 A Scalable and Highly Available Networked Database Architecture. VLDB Context-Based Prefetch for Implementing Objects on Relations. Philip A. Bernstein,Shankar Pal,David Shutt 1999 Context-Based Prefetch for Implementing Objects on Relations. VLDB Spatio-Temporal Retrieval with RasDaMan. Peter Baumann,Andreas Dehmel,Paula Furtado,Roland Ritsch,Norbert Widmann 1999 Spatio-Temporal Retrieval with RasDaMan. VLDB Microsoft English Query 7.5: Automatic Extraction of Semantics from Relational Databases and OLAP Cubes. Adam Blum 1999 Microsoft English Query 7.5: Automatic Extraction of Semantics from Relational Databases and OLAP Cubes. VLDB Database Architecture Optimized for the New Bottleneck: Memory Access. Peter A. Boncz,Stefan Manegold,Martin L. Kersten 1999 Database Architecture Optimized for the New Bottleneck: Memory Access. VLDB The New Locking, Logging, and Recovery Architecture of Microsoft SQL Server 7.0. David Campbell 1999 The New Locking, Logging, and Recovery Architecture of Microsoft SQL Server 7.0. VLDB O-O, What Have They Done to DB2? Michael J. Carey,Donald D. Chamberlin,Srinivasa Narayanan,Bennet Vance,Doug Doole,Serge Rielau,Richard Swagerman,Nelson Mendonça Mattos 1999 O-O, What Have They Done to DB2? VLDB Miro Web: Integrating Multiple Data Sources through Semistructured Data Types. Luc Bouganim,Tatiana Chan-Sine-Ying,Tuyet-Tram Dang-Ngoc,Jean-Luc Darroux,Georges Gardarin,Fei Sha 1999 Miro Web: Integrating Multiple Data Sources through Semistructured Data Types. VLDB Active Storage Hierarchy, Database Systems and Applications - Socratic Exegesis. "Felipe Cariño,William O'Connell,John Burgess,Joel H. Saltz" 1999 Active Storage Hierarchy, Database Systems and Applications - Socratic Exegesis. VLDB Issues in Network Management in the Next Millennium. Michael L. Brodie,Surajit Chaudhuri 1999 Issues in Network Management in the Next Millennium. VLDB Data-Driven, One-To-One Web Site Generation for Data-Intensive Applications. Stefano Ceri,Piero Fraternali,Stefano Paraboschi 1999 Data-Driven, One-To-One Web Site Generation for Data-Intensive Applications. VLDB Distributed Hypertext Resource Discovery Through Examples. Soumen Chakrabarti,Martin van den Berg,Byron Dom 1999 Distributed Hypertext Resource Discovery Through Examples. VLDB Hierarchical Prefix Cubes for Range-Sum Queries. Chee Yong Chan,Yannis E. Ioannidis 1999 Hierarchical Prefix Cubes for Range-Sum Queries. VLDB Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. Qi Cheng,Jarek Gryz,Fred Koo,T. Y. Cliff Leung,Linqi Liu,Xiaoyan Qian,K. Bernhard Schiefer 1999 Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. VLDB Evaluating Top- Selection Queries. Surajit Chaudhuri,Luis Gravano 1999 Evaluating Top- Selection Queries. VLDB Comparing Hierarchical Data in External Memory. Sudarshan S. Chawathe 1999 Comparing Hierarchical Data in External Memory. VLDB High Level Indexing of User-Defined Types. Weidong Chen,Jyh-Herng Chow,You-Chin Fuh,Jean Grandbois,Michelle Jou,Nelson Mendonça Mattos,Brian T. Tran,Yun Wang 1999 High Level Indexing of User-Defined Types. VLDB Physical Data Independence, Constraints, and Optimization with Universal Plans Alin Deutsch,Lucian Popa,Val Tannen 1999 Physical Data Independence, Constraints, and Optimization with Universal Plans VLDB VOODB: A Generic Discrete-Event Random Simulation Model To Evaluate the Performances of OODBs. Jérôme Darmont,Michel Schneider 1999 VOODB: A Generic Discrete-Event Random Simulation Model To Evaluate the Performances of OODBs. VLDB Probabilistic Optimization of Top N Queries. Donko Donjerkovic,Raghu Ramakrishnan 1999 Probabilistic Optimization of Top N Queries. VLDB Curio: A Novel Solution for Efficient Storage and Indexing in Data Warehouses. Anindya Datta,Krithi Ramamritham,Helen M. Thomas 1999 Curio: A Novel Solution for Efficient Storage and Indexing in Data Warehouses. VLDB Industrial Panel on Data Warehousing Technologies: Experiences, Challenges, and Directions. Umeshwar Dayal 1999 Industrial Panel on Data Warehousing Technologies: Experiences, Challenges, and Directions. VLDB Capturing and Querying Multiple Aspects of Semistructured Data. Curtis E. Dyreson,Michael H. Böhlen,Christian S. Jensen 1999 Capturing and Querying Multiple Aspects of Semistructured Data. VLDB What Do Those Weird XML Types Want, Anyway? Steven J. DeRose 1999 What Do Those Weird XML Types Want, Anyway? VLDB SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. Minos N. Garofalakis,Rajeev Rastogi,Kyuseok Shim 1999 SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB Optimization of Run-time Management of Data Intensive Web-sites. Daniela Florescu,Alon Y. Levy,Dan Suciu,Khaled Yagoub 1999 Optimization of Run-time Management of Data Intensive Web-sites. VLDB Implementation of SQL3 Structured Types with Inheritance and Value Substitutability. You-Chin Fuh,Stefan Deßloch,Weidong Chen,Nelson Mendonça Mattos,Brian T. Tran,Bruce G. Lindsay,Linda DeMichel,Serge Rielau,Danko Mannhaupt 1999 Implementation of SQL3 Structured Types with Inheritance and Value Substitutability. VLDB Similarity Search in High Dimensions via Hashing. Aristides Gionis,Piotr Indyk,Rajeev Motwani 1999 Similarity Search in High Dimensions via Hashing. VLDB GHOST: Fine Granularity Buffering of Indexes. Cheng Hian Goh,Beng Chin Ooi,D. Sim,Kian-Lee Tan 1999 GHOST: Fine Granularity Buffering of Indexes. VLDB The Value of Merge-Join and Hash-Join in SQL Server. Goetz Graefe 1999 The Value of Merge-Join and Hash-Join in SQL Server. VLDB Loading a Cache with Query Results. Laura M. Haas,Donald Kossmann,Ioana Ursu 1999 Loading a Cache with Query Results. VLDB Networked Data Management Design Points. James R. Hamilton 1999 Networked Data Management Design Points. VLDB PM3: An Orthogonal Persistent Systems Programming Language - Design, Implementation, Performance. Antony L. Hosking,Jiawan Chen 1999 PM3: An Orthogonal Persistent Systems Programming Language - Design, Implementation, Performance. VLDB Histogram-Based Approximation of Set-Valued Query-Answers. Yannis E. Ioannidis,Viswanath Poosala 1999 Histogram-Based Approximation of Set-Valued Query-Answers. VLDB User-Defined Table Operators: Enhancing Extensibility for ORDBMS. Michael Jaedicke,Bernhard Mitschang 1999 User-Defined Table Operators: Enhancing Extensibility for ORDBMS. VLDB Multi-Dimensional Substring Selectivity Estimation. H. V. Jagadish,Olga Kapitskaia,Raymond T. Ng,Divesh Srivastava 1999 Multi-Dimensional Substring Selectivity Estimation. VLDB What can Hierarchies do for Data Warehouses? H. V. Jagadish,Laks V. S. Lakshmanan,Divesh Srivastava 1999 What can Hierarchies do for Data Warehouses? VLDB Semantic Compression and Pattern Extraction with Fascicles. H. V. Jagadish,J. Madar,Raymond T. Ng 1999 Semantic Compression and Pattern Extraction with Fascicles. VLDB A Novel Index Supporting High Volume Data Warehouse Insertion. Chris Jermaine,Anindya Datta,Edward Omiecinski 1999 A Novel Index Supporting High Volume Data Warehouse Insertion. VLDB Performance Measurements of Compressed Bitmap Indices. Theodore Johnson 1999 Performance Measurements of Compressed Bitmap Indices. VLDB Integrating Heterogenous Overlapping Databases through Object-Oriented Transformations. Vanja Josifovski,Tore Risch 1999 Integrating Heterogenous Overlapping Databases through Object-Oriented Transformations. VLDB Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. Alexander Hinneburg,Daniel A. Keim 1999 Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. VLDB Generalised Hash Teams for Join and Group-by. Alfons Kemper,Donald Kossmann,Christian Wiesner 1999 Generalised Hash Teams for Join and Group-by. VLDB Finding Intensional Knowledge of Distance-Based Outliers. Edwin M. Knorr,Raymond T. Ng 1999 Finding Intensional Knowledge of Distance-Based Outliers. VLDB Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation. Arnd Christian König,Gerhard Weikum 1999 Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation. VLDB High-Performance Extensible Indexing. Marcel Kornacker 1999 High-Performance Extensible Indexing. VLDB Mining Deviants in a Time Series Database. H. V. Jagadish,Nick Koudas,S. Muthukrishnan 1999 Mining Deviants in a Time Series Database. VLDB Extracting Large-Scale Knowledge Bases from the Web. Ravi Kumar,Prabhakar Raghavan,Sridhar Rajagopalan,Andrew Tomkins 1999 Extracting Large-Scale Knowledge Bases from the Web. VLDB On Efficiently Implementing SchemaSQL on an SQL Database System. Laks V. S. Lakshmanan,Fereidoon Sadri,Subbu N. Subramanian 1999 On Efficiently Implementing SchemaSQL on an SQL Database System. VLDB Unrolling Cycles to Decide Trigger Termination. Sin Yeung Lee,Tok Wang Ling 1999 Unrolling Cycles to Decide Trigger Termination. VLDB Aggregation Algorithms for Very Large Compressed Data Warehouses. Jianzhong Li,Doron Rotem,Jaideep Srivastava 1999 Aggregation Algorithms for Very Large Compressed Data Warehouses. VLDB Query Optimization for XML. Jason McHugh,Jennifer Widom 1999 Query Optimization for XML. VLDB Repeating History Beyond ARIES. C. Mohan 1999 Repeating History Beyond ARIES. VLDB Quality-driven Integration of Heterogenous Information Systems. Felix Naumann,Ulf Leser,Johann Christoph Freytag 1999 Quality-driven Integration of Heterogenous Information Systems. VLDB Generating Call-Level Interfaces for Advanced Database Application Programming. Udo Nink,Theo Härder,Norbert Ritter 1999 Generating Call-Level Interfaces for Advanced Database Application Programming. VLDB The Persistent Cache: Improving OID Indexing in Temporal Object-Oriented Database Systems. Kjetil Nørvåg 1999 The Persistent Cache: Improving OID Indexing in Temporal Object-Oriented Database Systems. VLDB Fast Algorithms for Maintaining Replica Consistency in Lazy Master Replicated Databases. Esther Pacitti,Pascale Minet,Eric Simon 1999 Fast Algorithms for Maintaining Replica Consistency in Lazy Master Replicated Databases. VLDB Extending Practical Pre-Aggregation in On-Line Analytical Processing. Torben Bach Pedersen,Christian S. Jensen,Curtis E. Dyreson 1999 Extending Practical Pre-Aggregation in On-Line Analytical Processing. VLDB Exploiting Versions for Handling Updates in Broadcast Disks. Evaggelia Pitoura,Panos K. Chrysanthis 1999 Exploiting Versions for Handling Updates in Broadcast Disks. VLDB In Cyber Space No One can Hear You Scream. Chris Pound 1999 In Cyber Space No One can Hear You Scream. VLDB Online Dynamic Reordering for Interactive Data Processing. Vijayshankar Raman,Bhaskaran Raman,Joseph M. Hellerstein 1999 We present a pipelining, dynamically user-controllable reorder operator, for use in data-intensive applications. Allowing the user to reorder the data delivery on the fly increases the interactivity in several contexts such as online aggregation and large-scale spreadsheets; it allows the user to control the processing of data by dynamically specifying preferences for different data items based on prior feedback, so that data of interest is prioritized for early processing. In this paper we describe an efficient, non-blocking mechanism for reordering, which can be used over arbitrary data streams from files, indexes, and continuous data feeds. We also investigate several policies for the reordering based on the performance goals of various typical applications. We present results from an implementation used in Online Aggregation in the Informix Dynamic Server with Universal Data Option, and in sorting and scrolling in a large-scale spreadsheet. Our experiments demonstrate that for a variety of data distributions and applications, reordering is responsive to dynamic preference changes, imposes minimal overheads in overall completion time, and provides dramatic improvements in the quality of the feedback over time. Surprisingly, preliminary experiments indicate that online reordering can also be useful in traditional batch query processing, because it can serve as a form of pipelined, approximate sorting. VLDB Cache Conscious Indexing for Decision-Support in Main Memory. Jun Rao,Kenneth A. Ross 1999 Cache Conscious Indexing for Decision-Support in Main Memory. VLDB Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System. Mary Tork Roth,Fatma Ozcan,Laura M. Haas 1999 Cost Models DO Matter: Providing Cost Information for Diverse Data Sources in a Federated System. VLDB Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. Arnaud Sahuguet,Fabien Azavant 1999 Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. VLDB Explaining Differences in Multidimensional Aggregates. Sunita Sarawagi 1999 Explaining Differences in Multidimensional Aggregates. VLDB Relational Databases for Querying XML Documents: Limitations and Opportunities. Jayavel Shanmugasundaram,Kristin Tufte,Chun Zhang,Gang He,David J. DeWitt,Jeffrey F. Naughton 1999 Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. Masahisa Tamura,Masaru Kitsuregawa 1999 Dynamic Load Balancing for Parallel Association Rule Mining on Heterogenous PC Cluster Systems. VLDB Online Feedback for Nested Aggregate Queries with Multi-Threading. Kian-Lee Tan,Cheng Hian Goh,Beng Chin Ooi 1999 Online Feedback for Nested Aggregate Queries with Multi-Threading. VLDB An Adaptive Hybrid Server Architecture for Client Caching ODBMSs. Kaladhar Voruganti,M. Tamer Özsu,Ronald C. Unrau 1999 An Adaptive Hybrid Server Architecture for Client Caching ODBMSs. VLDB The Mirror MMDBMS Architecture. Arjen P. de Vries,Mark G. L. M. van Doorn,Henk M. Blanken,Peter M. G. Apers 1999 The Mirror MMDBMS Architecture. VLDB Building Hierarchical Classifiers Using Class Proximity. Ke Wang,Senqiang Zhou,Shiang Chen Liew 1999 Building Hierarchical Classifiers Using Class Proximity. VLDB Hyper-Programming in Java. Evangelos Zirintsis,Graham N. C. Kirby,Ronald Morrison 1999 Hyper-Programming in Java. VLDB Datawarehousing Has More Colours Than Just Black & White. Thomas Zurek,Markus Sinnwell 1999 Datawarehousing Has More Colours Than Just Black & White. SIGMOD Record Semantic Integration of Semistructured and Structured Data Sources. Sonia Bergamaschi,Silvana Castano,Maurizio Vincini 1999 Providing an integrated access to multiple heterogeneous sources is a challenging issue in global information systems for cooperation and interoperability. In this context, two fundamental problems arise. First, how to determine if the sources contain semantically related information, that is, information related to the same or similar real-world concept(s). Second, how to handle semantic heterogeneity to support integration and uniform query interfaces. Complicating factors with respect to conventional view integration techniques are related to the fact that the sources to be integrated already exist and that semantic heterogeneity occurs on the large-scale, involving terminology, structure, and context of the involved sources, with respect to geographical, organizational, and functional aspects related to information use. Moreover, to meet the requirements of global, Internet-based information systems, it is important that tools developed for supporting these activities are semi-automatic and scalable as much as possible. The goal of this paper is to describe the MOMIS [4, 5] (Mediator envirOnment for Multiple Information Sources) approach to the integration and query of multiple, heterogeneous information sources, containing structured and semistructured data. MOMIS has been conceived as a joint collaboration between University of Milano and Modena in the framework of the INTERDATA national research project, aiming at providing methods and tools for data management in Internet-based information systems. Like other integration projects [1, 10, 14], MOMIS follows a “semantic approach” to information integration based on the conceptual schema, or metadata, of the information sources, and on the following architectural elements: i) a common object-oriented data model, defined according to the ODLI3 language, to describe source schemas for integration purposes. The data model and ODLI3 have been defined in MOMIS as subset of the ODMG-93 ones, following the proposal for a standard mediator language developed by the I3/POB working group [7]. In addition, ODLI3 introduces new constructors to support the semantic integration process [4, 5]; ii) one or more wrappers, to translate schema descriptions into the common ODLI3 representation; iii) a mediator and a query-processing component, based on two pre-existing tools, namely ARTEMIS [8] and ODB-Tools [3] (available on Internet at http://sparc20.dsi.unimo.it/), to provide an I3 architecture for integration and query optimization. In this paper, we focus on capturing and reasoning about semantic aspects of schema descriptions of heterogeneous information sources for supporting integration and query optimization. Both semistructured and structured data sources are taken into account [5]. A Common Thesaurus is constructed, which has the role of a shared ontology for the information sources. The Common Thesaurus is built by analyzing ODLI3 descriptions of the sources, by exploiting the Description Logics OLCD (Object Language with Complements allowing Descriptive cycles) [2, 6], derived from KL-ONE family [17]. The knowledge in the Common Thesaurus is then exploited for the identification of semantically related information in ODLI3 descriptions of different sources and for their integration at the global level. Mapping rules and integrity constraints are defined at the global level to express the relationships holding between the integrated description and the sources descriptions. ODB-Tools, supporting OLCD and description logic inference techniques, allows the analysis of sources descriptions for generating a consistent Common Thesaurus and provides support for semantic optimization of queries at the global level, based on defined mapping rules and integrity constraints. SIGMOD Record On Views and XML. Serge Abiteboul 1999 On Views and XML. SIGMOD Record Towards Adaptive Workflow Systems, CSCW-98 Workshop Report. Abraham Bernstein,Chrysanthos Dellarocas,Mark Klein 1999 The workshop Towards Adaptive Workflow System was organized by the authors of this report as part of the 1998 Conference on Computer Supported Collaborative Work (CSCW-98), and was held at the Westin Seattle on Saturday, November 14, 1998. The workshop had about 30 attendees and included invited presentations, paPer presentations/discussions and a panel. This report summarizes on the Goals and topics of the workshop, presents the major activities and summarizes some of the issues discussed during the workshop. SIGMOD Record Cost Estimation of User-Defined Methods in Object-Relational Database Systems. Jihad Boulos,Kinji Ono 1999 In this paper we present a novel technique for cost estimation of user-defined methods in advanced database systems. This technique is based on multi-dimensional histograms. We explain how the system collects statistics on the method that a database user defines and adds to the system. From these statistics a multi-dimensional histogram is built. Afterwards, this histrogram can be used for estimating the cost of the target method whenever this method is referenced in a query. This cost estimation is needed by the optimizer of the database system since this cost estimation needs to know the cost of a method in order to place it at its optimal position in the Query Execution Plan (QEP). We explain here how our technique works and we provide an example to better verify its functionality. SIGMOD Record The Active Database Central, ER2000, VLDB 2000, WISE 2000, Books, ACM DL. 1999 The Active Database Central, ER2000, VLDB 2000, WISE 2000, Books, ACM DL. SIGMOD Record Semantic Interoperability in Information Services: Experiencing with CoopWARE. Avigdor Gal 1999 Semantic Interoperability in Information Services: Experiencing with CoopWARE. SIGMOD Record NSF Workshop on Industrial/Academic Cooperation in Database Systems. Michael J. Carey,Leonard J. Seligman 1999 NSF Workshop on Industrial/Academic Cooperation in Database Systems. SIGMOD Record Design Principles for Data-Intensive Web Sites. Stefano Ceri,Piero Fraternali,Stefano Paraboschi 1999 Design Principles for Data-Intensive Web Sites. SIGMOD Record "Engineering Federated Information Systems: Report of EFIS '99 Workshop." Stefan Conrad,Wilhelm Hasselbring,Uwe Hohenstein,Ralf-Detlef Kutsche,Mark Roantree,Gunter Saake,Fèlix Saltor 1999 "After the successful first International Workshop on Engineering Federated Database Systems (EFDBS'97) in Barcelona in June 1997 [CEH+ 97], the goal of this second workshop was to bring together researchers and practitioners interested in various issues in the development of federated information systems, whereby the scope has been extended to cover database and non-database information sources (the change from EFDBS to EFIS reflects this). This report provides details of the workshop content and the conclusions reached in the final discussion." SIGMOD Record An Overview and Classification of Mediated Query Systems. Ruxandra Domenig,Klaus R. Dittrich 1999 Multimedia technology, global information infrastructures and other developments allow users to access more and more information sources of various types. However, the “technical” availability alone (by means of networks, WWW, mail systems, databases, etc.) is not sufficient for making meaningful and advanced use of all information available on-line. Therefore, the problem of effectively and efficiently accessing and querying heterogeneous and distributed data sources is an important research direction. This paper aims at classifying existing approaches which can be used to query heterogeneous data sources. We consider one of the approaches — the mediated query approach — in more detail and provide a classification framework for it as well. SIGMOD Record Semantic and Pedagogic Interoperability Mechanisms in the ARIADNE Educational Repository. Eddy Forte,Florence Haenni,Ken Warkentyne,Erik Duval,Kris Cardinaels,E. Vervaet,Koen Hendrikx,Maria Wentland Forte,Florence Simillion 1999 "This paper reports on the principles underlying the semantic and pedagogic interoperability mechanisms built in the European Knowledge Pool System, developed by the European research project ARIADNE. This system, which is the central feature of ARIADNE, consists in a distributed repository of pedagogical documents (or learning objects) of diverse granularity, origin, content, type, language, etc., which are stored in view of their use (and reuse) in telematics-based training or teaching curricula. The learning objects are indexed, usually by faculty staff, according to the ARIADNE metadata set. The principles embodied in the indexation tool, which interacts directly with the repository,stem from a few theoretical ideas but foremost from empirical, pragmatic considerations, suggested by the context of actual use. They tentatively address the stringent demands for semantic and pedagogic interoperability implied by a context of rather wide cultural and linguistic diversity, as well as those stemming from the very nature of the domain application itself: education and training. Possible extensions to the educational metadata scheme developed by ARIADNE on these basis, may accommodate corporate training/information needs. These extensions are briefly discussed as a mean for enhancing 'semantic' interoperability between different (kinds of) corporations. Finally, the architecture of the ARIADNE system, which heavily relies on this educational metadata system, is briefly reviewed." SIGMOD Record Agent-Based Semantic Interoperability in InfoSleuth. Jerry Fowler,Brad Perry,Marian H. Nodine,Bruce Bargmeyer 1999 Agent-Based Semantic Interoperability in InfoSleuth. SIGMOD Record Chorochronos: A Research Network for Spatiotemporal Database Systems. Andrew U. Frank,Stéphane Grumbach,Ralf Hartmut Güting,Christian S. Jensen,Manolis Koubarakis,Nikos A. Lorentzos,Yannis Manolopoulos,Enrico Nardelli,Barbara Pernici,Hans-Jörg Schek,Michel Scholl,Timos K. Sellis,Babis Theodoulidis,Peter Widmayer 1999 Chorochronos: A Research Network for Spatiotemporal Database Systems. SIGMOD Record SQL: 1999, formerly known as SQL 3. Andrew Eisenberg,Jim Melton 1999 SQL: 1999, formerly known as SQL 3. SIGMOD Record SQLJ-Part 1: SQL Routines Using the Java Programming Language. Andrew Eisenberg,Jim Melton 1999 SQLJ-Part 1: SQL Routines Using the Java Programming Language. SIGMOD Record "Editor's Notes." Michael J. Franklin 1999 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 1999 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 1999 "Editor's Notes." SIGMOD Record "Report on NGITS'99: The 4th International Workshop on Next Generation Information Technologies and Systems." Opher Etzion 1999 "Report on NGITS'99: The 4th International Workshop on Next Generation Information Technologies and Systems." SIGMOD Record "Design and Management of Data Warehouses: Report on the DMDW'99 Workshop." Stella Gatziu,Manfred A. Jeusfeld,Martin Staudt,Yannis Vassiliou 1999 "Design and Management of Data Warehouses: Report on the DMDW'99 Workshop." SIGMOD Record Statement from the Treasurer. Eric N. Hanson 1999 Statement from the Treasurer. SIGMOD Record Database Research at the University of Oklahoma. Le Gruenwald,Leonard Brown,Ravi A. Dirckze,Sylvain Guinepain,Carlos Sánchez,Brian Summers,Sirirut Vanichayobon 1999 Database Research at the University of Oklahoma. SIGMOD Record FinTime - A Financial Time Series Benchmark. Kaippallimalil J. Jacob,Dennis Shasha 1999 FinTime - A Financial Time Series Benchmark. SIGMOD Record Semantic Video Indexing: Approach and Issue. Arun Hampapur 1999 Providing concept level access to video data requires, video management systems tailored to the domain of the data. Effective indexing and retrieval for high-level access mandates the use of domain knowledge. This paper proposes an approach based on the use of knowledge models to building domain specific video information systems. The key issues in such systems are identified and discussed. SIGMOD Record Timer-Driven Database Triggers and Alerters: Semantics and a Challenge. Eric N. Hanson,Lloyd Noronha 1999 This paper proposes a simple model for a timer-driven triggering and alerting system. Such a system can be used with relational and object-relational databases systems. Timer-driven trigger systems have a number of advantages over traditional trigger systems that test trigger conditions and run trigger actions in response to update events. They are relatively easy to implement since they can be built using a middleware program that simply runs SQL statements against a DBMS. Also, they can check certain types of conditions, such as “a value did not change” or a “a value did not change by more than 10% in six months.” Such conditions may be of interest for a particular application, but cannot be checked correctly by an event-driven trigger system. Also, users may be perfectly happy being notified once a day, once a week, or even less often of certain conditions, depending on their application. Timer triggers are appropriate for these users. The semantics of timer triggers are defined here using a simple procedure. Timer triggers are meant to complement event-driven triggers, not replace them. We challenge the database research community to developed alternate algorithms and optimizations for processing timer triggers, provided that the semantics are the same as when using the simple procedure presented here. SIGMOD Record Diluting ACID. Tim Kempster,Colin Stirling,Peter Thanisch 1999 Several DBMS vendors have implemented the ANSI standard SQL isolation levels for transaction processing. This has created a gap between database practice and textbook accounts of transaction processing which simply equate isolation with serializability. We extend the notion of conflict to cover lower isolation levels and we present improved characterisations of classes of schedules achieving these levels. SIGMOD Record Some Remarks on Variable Independence, Closure, and Orthographic Dimension in Constraint Databases. Leonid Libkin 1999 "The notion of variable independence was introduced by Chomicki, Goldin, and Kuper in their PODS'96 paper as a means of adding a limited form of aggregation to constraint query languages while retaining the closure property. Later, Grumbach, Rigoux and Segoufin showed in their ICDT'99 paper that variable independence and a related notion of orthographic dimension are useful tools for optimizing constraint queries. However, several results in those papers are incorrect as stated. As the notions of variable independence and orthographic dimension appear to be important for implementing constraint database prototypes, I explain in this short note the problems with the above mentioned papers and outline a solution for aggregate closure." SIGMOD Record Database Principles Column - Introduction. Leonid Libkin 1999 Database Principles Column - Introduction. SIGMOD Record Message from Editor-in-Chief, ACM Transactions on Database Systems. Won Kim 1999 Message from Editor-in-Chief, ACM Transactions on Database Systems. SIGMOD Record A Study on Data Point Search for HG-Trees. Joseph K. P. Kuan,Paul H. Lewis 1999 A point data retrieval algorithm for the HG-tree is introduced which improves the number of nodes accessed. The HG-tree is a multidimensional indexing tree designed for point data and it is a simple modification from the Hilbert R-tree for indexing spatial data. The HG-tree data search method mainly makes use of the Hilbert index values to search for exact data, instead of using conventional point search methods as used in most of the R-tree papers. The use of Hilbert curve values and MBR can reduce the spatial cover of an MBR. Several R-tree variants have been developed; R*-tree, S-tree, Hilbert R-tree, and R*-tree combined with the linear split method by Ang et al. Our search method on the HG-tree gives a superior speed performance compared to all other R-tree variants. SIGMOD Record Semantic Integration of Environmental Models for Application to Global Information Systems and Decision-Making. D. Scott Mackay 1999 Global information systems have the potential of providing decision makers with timely spatial information about earth systems. This information will come from diverse sources, including field monitoring, remotely sensed imagery, and environmental models. Of the three the latter has the greatest potential of providing regional and global scale information on the behavior of environmental systems, which may be vital for setting multi-governmental policy and for making decisions that are critical to quality of life. However, environmental models have limited prootocol for quality control and standardization. They tend to have weak or poorly defined semantics and so their output is often difficult to interpret outside a very limited range of applications for which they are designed. This paper considers this issue with respect to spatially distributed environmental models. A method of measuring the semantic proximity between components of large, integrated models is presented, along with an example illustrating its application. It is concluded that many of the issues associated with weak model semantics can be resolved with the addition of self-evaluating logic and context-based tools that present the semantic weaknesses to the end-user. SIGMOD Record Practical Lessons in Supporting Large-Scale Computational Science. Ron Musick,Terence Critchlow 1999 Practical Lessons in Supporting Large-Scale Computational Science. SIGMOD Record "Report on the 13th Brazilian Symposium on Database Systems (SBBD'98)." Mario A. Nascimento,Claudia Bauzer Medeiros 1999 "The Brazilian Symposium on Database Systems (SBBD) is a traditional conference in Brazil, and is sponsored by the Brazilian Computer Society. SBBD's technical program contemplates the following activities: presentation of peer reviewed full technical papers, invited talks, tutorials (either invited and selected from submissions), discussion panels and presentation of tools." SIGMOD Record Semantic Interoperability in Global Information Systems: A Brief Introduction to the Research Area and the Special Section. Aris M. Ouksel,Amit P. Sheth 1999 Semantic Interoperability in Global Information Systems: A Brief Introduction to the Research Area and the Special Section. SIGMOD Record "Vice Chair's Message." Z. Meral Özsoyoglu 1999 "Vice Chair's Message." SIGMOD Record Contextualizing the Information Space in Federated Digital Libraries. Mike P. Papazoglou,Jeroen Hoppenbrouwers 1999 Rapid growth in the volume of documents, their diversity, and terminological variations render federated digital libraries increasingly difficult to manage. Suitable abstraction mechanisms are required to construct meaningful and scalable document clusters, forming a cross-digital library information space for browsing and semantic searching. This paper addresses the above issues, proposes a distributed semantic framework that achieves a logical partitioning of the information space according to topic areas, and provides facilities to contextualize and landscape the available document sets in subject-specific categories. SIGMOD Record A Distributed Scientific Data Archive Using the Web, XML and SQL/MED. Mark Papiani,Jasmin L. Wason,Alistair N. Dunlop,Denis A. Nicole 1999 We have developed a web-based architecture and user interface for fast storage, searching and retrieval of large, distributed, files resulting from scientific simulations. We demonstrate that the new DATALINK type defined in the draft SQL Management of External Data Standard can help to overcome problems associated with limited bandwidth when trying to archive large files using the web. We also show that separating the user interface specification from the user interface processing can provide a number of advantages. We provide a tool to generate automatically a default user interface specification, in the form of an XML document, for a given database. This facilitates deployment of our system by users with little web or database development experience. The XML document can be customised to change the appearance of the interface. SIGMOD Record Distributed Transactions in Practice. Prabhu Ram,Lyman Do,Pamela Drew 1999 The concept of transactions and its application has found wide and often indiscriminate usage. In large enterprises, the model for distributed database applications has moved away from the client-server model to a multi-tier model with large database application software forming the middle tier. The software philosophy of “buy and not build” in large enterprises has had a major influence by extending functional requirements such as transactions and data consistency throughout the multiple tiers. In this article, we will discuss the effects of applying traditional transaction management techniques to multi-tier architectures in distributed environments. We will show the performance costs associated with distributed transactions and discuss ways by which enterprises really manage their distributed data to circumvent this performance hit. Our intent is to share our experience as an industrial customer with the database research and vendor community to create more usable and scalable designs. SIGMOD Record The OASIS Multidatabase Prototype. Mark Roantree,John Murphy,Wilhelm Hasselbring 1999 The OASIS Prototype is under development at Dublin City University in Ireland. We describe a multi-database architecture which uses the ODMG model as a canonical model and describe an extention for construction of virtual schemas within the multidatabase system. The OMG model is used to provide a standard distribution layer for data from local databases. This takes the form of CORBA objects representing export schemas from separate data sources. SIGMOD Record First-Class Views: A Key to User-Centered Computing. Arnon Rosenthal,Edward Sciore 1999 Large database systems (e.g., federations, warehouses) are multi-layer — i.e., a combination of base databases and (virtual or physical) view databases1. Smaller systems use views for layers that hide detailed physical and conceptual structures. We argue that most databases would be more effective if they were more user-centered — i.e., if they allowed users, administrators, and application developers to work mostly within their native view. To do so, we need first class views — views that support most of the metadata and operations available on source tables. First class views could also make multi-tier object architectures (based on objects in multiple tiers of servers) easier to build and maintain. The views modularize code for data services (e.g., query, security) and for coordinating changes with neighboring tiers. When data in each tier is derived declaratively, one can generate some of these methods semi-automatically. Much of the functionality required to support first class views can be generated semi-automatically, if the derivations between layers are declarative (e.g., SQL, rather than Java). We present a framework where propagation rules can be defined, allowing the flexible and incremental specification of view semantics, even by non-programmers. Finally, we describe research areas opened up by this approach. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Surajit Chaudhuri,Gösta Grahne,H. V. Jagadish,Jan Van den Bussche,Moshe Y. Vardi 1999 Reminiscences on Influential Papers. SIGMOD Record VideoAnywhere: A System for Searching and Managing Distributed Video Assets. Amit P. Sheth,Clemens Bertram,Kshitij Shah 1999 VideoAnywhere: A System for Searching and Managing Distributed Video Assets. SIGMOD Record Unpacking The Semantics of Source and Usage To Perform Semantic Reconciliation In Large-Scale Information Systems. Ken Smith,Leo Obrst 1999 Semantic interoperability is a growing challenge in the United States Department of Defense (DoD). In this paper, we describe the basis of an infrastructure for the reconciliation of relevant, but semantically heterogeneous attribute values. Three types of information are described which can be used to infer the context of attributes, making explicit hidden semantic conflicts and making it possible to adjust values appropriately. Through an extended example, we show how an automated integration agent can derive the transformations necessary to perform four tasks in a simple semantic reconciliation. SIGMOD Record Dynamic Service Matchmaking Among Agents in Open Information Environments. Katia P. Sycara,Matthias Klusch,Seth Widoff,Jianguo Lu 1999 Dynamic Service Matchmaking Among Agents in Open Information Environments. SIGMOD Record "Chair's Message." Richard T. Snodgrass 1999 "Chair's Message." SIGMOD Record "Chair's Message." Richard T. Snodgrass 1999 "Chair's Message." SIGMOD Record Reminiscences on Influential Papers. Richard T. Snodgrass 1999 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Richard T. Snodgrass,Serge Abiteboul,Sophie Cluet,Michael J. Franklin,Guy M. Lohman,David B. Lomet,Gultekin Özsoyoglu,Raghu Ramakrishnan,Kenneth A. Ross,Timos K. Sellis,Patrick Valduriez 1999 Reminiscences on Influential Papers. SIGMOD Record Efficient Materialization and Use of Views in Data Warehouses. Márcio Farias de Souza,Marcus Costa Sampaio 1999 Given the complexity of many queries over a Data Warehouse (DW), it is interesting to precompute and store in the DW the answer sets of some demanding operations, so called materialized views. In this paper, we present an algorithm, including its experimental evaluation, which allows the materialization of several views simultaneously without losing sight of processing costs for queries using these materialized views. SIGMOD Record A Survey of Logical Models for OLAP Databases. Panos Vassiliadis,Timos K. Sellis 1999 In this paper, we present different proposals for multidimensional data cubes, which are the basic logical model for OLAP applications. We have grouped the work in the field in two categories: commercial tools (presented along with terminology and standards) and academic efforts. We further divide the academic efforts in two subcategories: the relational model extensions and the cube-oriented approaches. Finally, we attempt a comparative analysis of the various efforts. SIGMOD Record On Multi-Resolution Document Transmission in Mobile Web. Stanley M. T. Yau,Hong Va Leong,Dennis McLeod,Antonio Si 1999 We propose a multi-resolution transmission mechanism that allows various organizational units of a web document to be transferred and browsed according to the amount of information captured. We define the notion of information content for each individual organizational unit of a web document as an indication of its captured information. The concept of information content is used as a foundation for defining the notion of relative information content which determines the transmission order of various units. Our mechanism allows a web client to explore the more content-bearing portion of a web document earlier so as to be able to terminate browsing a possibly irrelevant document sooner. This scheme is based on our observation that different organizational units of a document contribute to different amount of information to the document. Such a multi-resolution transmission paradigm is particularly useful in mobile web where the wireless bandwidth is a scarce resource and browsing every document in detail would consume the bandwidth unnecessarily. This is becoming more serious when the size of a web document is getting large, such as technical documents. We then present a prototype of the system in Java and CORBA to illustrate its feasibility. ICDE Optimization of Hypothetical Queries in an OLAP Environment. Andrey Balmin,Yannis Papakonstantinou,Thanos Papadimitriou 2000 Optimization of Hypothetical Queries in an OLAP Environment. ICDE Accurate Estimation of the Cost of Spatial Selections. Ashraf Aboulnaga,Jeffrey F. Naughton 2000 Accurate Estimation of the Cost of Spatial Selections. ICDE Generalized Isolation Level Definitions. "Atul Adya,Barbara Liskov,Patrick E. O'Neil" 2000 Generalized Isolation Level Definitions. ICDE Oracle8i - The XML Enabled Data Management System. Sandeepan Banerjee,Vishu Krishnamurthy,Muralidhar Krishnaprasad,Ravi Murthy 2000 Oracle8i - The XML Enabled Data Management System. ICDE TheaterLoc: Using Information Integration Technology to Rapidly Build Virtual Applications. Greg Barish,Yi-Shin Chen,Dan DiPasquo,Craig A. Knoblock,Steven Minton,Ion Muslea,Cyrus Shahabi 2000 TheaterLoc: Using Information Integration Technology to Rapidly Build Virtual Applications. ICDE Device Database Systems. Philippe Bonnet,Praveen Seshadri 2000 Device Database Systems. ICDE Semiorder Database for Complex Activity Recognition in Multi-Sensory Environments. Shailendra K. Bhonsle,Amarnath Gupta,Simone Santini,Ramesh Jain 2000 Semiorder Database for Complex Activity Recognition in Multi-Sensory Environments. ICDE Dynamic Query Scheduling in Data Integration Systems. Luc Bouganim,Françoise Fabret,C. Mohan,Patrick Valduriez 2000 Dynamic Query Scheduling in Data Integration Systems. ICDE Join Enumeration in a Memory-Constrained Environment. Ivan T. Bowman,G. N. Paulley 2000 Join Enumeration in a Memory-Constrained Environment. ICDE Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces. Stefan Berchtold,Christian Böhm,H. V. Jagadish,Hans-Peter Kriegel,Jörg Sander 2000 Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces. ICDE "Indexing High-Dimensional Spaces: Database Support for Next Decade's Applications." Stefan Berchtold,Daniel A. Keim 2000 "Indexing High-Dimensional Spaces: Database Support for Next Decade's Applications." ICDE Semantic Conditions for Correctness at Different Isolation Levels. Arthur J. Bernstein,Philip M. Lewis,Shiyong Lu 2000 Semantic Conditions for Correctness at Different Isolation Levels. ICDE Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases. Bernhard Braunmüller,Martin Ester,Hans-Peter Kriegel,Jörg Sander 2000 Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases. ICDE Declustering Using Golden Ratio Sequences. Randeep Bhatia,Rakesh K. Sinha,Chung-Min Chen 2000 Declustering Using Golden Ratio Sequences. ICDE Automating Statistics Management for Query Optimizers. Surajit Chaudhuri,Vivek R. Narasayya 2000 Statistics play a key role in influencing the quality of plans chosen by a database query optimizer. In this paper, we identify the statistics that are essential for an optimizer. We introduce novel techniques that help significantly reduce the set of statistics that need to be created without sacrificing the quality of query plans generated. We discuss how these techniques can be leveraged to automate statistics management in databases. We have implemented and experimentally evaluated our approach on Microsoft SQL Server 7.0. ICDE Answering Regular Path Queries Using Views. Diego Calvanese,Giuseppe De Giacomo,Maurizio Lenzerini,Moshe Y. Vardi 2000 Answering Regular Path Queries Using Views. ICDE An Algebraic Compression Framework for Query Results. Zhiyuan Chen,Praveen Seshadri 2000 An Algebraic Compression Framework for Query Results. ICDE A Data-Warehouse/OLAP Framework for Scalable Telecommunication Tandem Traffic Analysis. Qiming Chen,Meichun Hsu,Umeshwar Dayal 2000 A Data-Warehouse/OLAP Framework for Scalable Telecommunication Tandem Traffic Analysis. ICDE Self-Adaptive User Profiles for Large-Scale Data Delivery. Ugur Çetintemel,Michael J. Franklin,C. Lee Giles 2000 Self-Adaptive User Profiles for Large-Scale Data Delivery. ICDE Efficient Query Refinement in Multimedia Databases. Kaushik Chakrabarti,Kriengkrai Porkaew,Sharad Mehrotra 2000 Efficient Query Refinement in Multimedia Databases. ICDE Discovering Temporal Association Rules: Algorithms, Language and System. Xiaodong Chen,Ilias Petrounias 2000 Discovering Temporal Association Rules: Algorithms, Language and System. ICDE XML and DB2. Josephine M. Cheng,Jane Xu 2000 XML and DB2. ICDE Mobile and Wireless Database Access for Pervasive Computing. Panos K. Chrysanthis,Evaggelia Pitoura 2000 Mobile and Wireless Database Access for Pervasive Computing. ICDE PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces. Paolo Ciaccia,Marco Patella 2000 PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces. ICDE Finding Interesting Associations without Support Pruning. Edith Cohen,Mayur Datar,Shinji Fujiwara,Aristides Gionis,Piotr Indyk,Rajeev Motwani,Jeffrey D. Ullman,Cheng Yang 2000 Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar web documents, clustering, and collaborative filtering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis. ICDE Efficient Query Subscription Processing in a Multicast Environment. Arturo Crespo,Orkut Buyukkokten,Hector Garcia-Molina 2000 Efficient Query Subscription Processing in a Multicast Environment. ICDE Practical Lineage Tracing in Data Warehouses. Yingwei Cui,Jennifer Widom 2000 Practical Lineage Tracing in Data Warehouses. ICDE Lineage Tracing in a Data Warehousing System. Yingwei Cui,Jennifer Widom 2000 Lineage Tracing in a Data Warehousing System. ICDE Data Redundancy and Duplicate Detection in Spatial Join Processing. Jens-Peter Dittrich,Bernhard Seeger 2000 Data Redundancy and Duplicate Detection in Spatial Join Processing. ICDE The MARIFlow Workflow Management System. Asuman Dogac,M. Ezbiderli,Yusuf Tambag,C. Icdem,Arif Tumer,Nesime Tatbul,N. Hamali,Catriel Beeri 2000 The MARIFlow Workflow Management System. ICDE Dynamic Histograms: Capturing Evolving Data Sets. Donko Donjerkovic,Yannis E. Ioannidis,Raghu Ramakrishnan 2000 Dynamic Histograms: Capturing Evolving Data Sets. ICDE The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses. Martin Ester,Jörn Kohlhammer,Hans-Peter Kriegel 2000 The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses. ICDE MetaComm: A Meta-Directory for Telecommunications. Juliana Freire,Daniel F. Lieuwen,Joann J. Ordille,Lalit Garg,Michael Holder,Hector Urroz,Gavin Michael,Julian Orbach,Luke Tucker,Qian Ye,Robert M. Arlein 2000 MetaComm: A Meta-Directory for Telecommunications. ICDE Dynamic Miss-Counting Algorithms: Finding Implication and Similarity Rules with Confidence Pruning. Shinji Fujiwara,Jeffrey D. Ullman,Rajeev Motwani 2000 Dynamic Miss-Counting Algorithms: Finding Implication and Similarity Rules with Confidence Pruning. ICDE An Extensible Framework for Data Cleaning. Helena Galhardas,Daniela Florescu,Dennis Shasha,Eric Simon 2000 An Extensible Framework for Data Cleaning. ICDE DEMON: Mining and Monitoring Evolving Data. Venkatesh Ganti,Johannes Gehrke,Raghu Ramakrishnan 2000 Data mining algorithms have been the focus of much research recently. In practice, the input data to a data mining process resides in a large data warehouse whose data is kept up-to-date through periodic or occasional addition and deletion of blocks of data. Most data mining algorithms have either assumed that the input data is static, or have been designed for arbitrary insertions and deletions of data records. In this paper, we consider a dynamic environment that evolves through systematic addition or deletion of blocks of data. We introduce a new dimension, called the data span dimension, which allows user-defined selections of a temporal subset of the database. Taking this new degree of freedom into account, we describe efficient model maintenance algorithms for frequent itemsets and clusters. We then describe a generic algorithm that takes any traditional incremental model maintenance algorithm and transforms it into an algorithm that allows restrictions on the data span dimension. We also develop an algorithm for automatically discovering a specific class of interesting block selection sequences. In a detailed experimental study, we examine the validity and performance of our ideas on synthetic and real datasets. ICDE Data Mining with Decision Trees. Johannes Gehrke 2000 Data Mining with Decision Trees. ICDE Managing Escalation of Collaboration Processes in Crisis Mitigation Situations. Dimitrios Georgakopoulos,Hans Schuster,Donald Baker,Andrzej Cichocki 2000 Managing Escalation of Collaboration Processes in Crisis Mitigation Situations. ICDE Squeezing the Most out of Relational Database Systems. Jonathan Goldstein,Raghu Ramakrishnan 2000 Squeezing the Most out of Relational Database Systems. ICDE Efficient Mining of Constrained Correlated Sets. Gösta Grahne,Laks V. S. Lakshmanan,Xiaohong Wang 2000 Efficient Mining of Constrained Correlated Sets. ICDE Rules of Thumb in Data Engineering. Jim Gray,Prashant J. Shenoy 2000 Rules of Thumb in Data Engineering. ICDE READY: A High Performance Event Notification Service. Robert E. Gruber,Balachander Krishnamurthy,Euthimios Panagos 2000 READY: A High Performance Event Notification Service. ICDE Distributed Query Processing on the Web. Nalin Gupta,Jayant R. Haritsa,Maya Ramanath 2000 Distributed Query Processing on the Web. ICDE The IDEAL Approach to Internet-Based Negotiation for E-Business. Joachim Hammer,Chunbo Huang,Yihua Huang,Charnyote Pluempitiwiriyawej,Minsoo Lee,Haifei Li,Liu Wang,Youzhong Liu,Stanley Y. W. Su 2000 The IDEAL Approach to Internet-Based Negotiation for E-Business. ICDE Web Information Retrieval. Monika Rauch Henzinger 2000 Web Information Retrieval. ICDE Optimization Techniques for Data-Intensive Decision Flows. Richard Hull,François Llirbat,Bharat Kumar,Gang Zhou,Guozhu Dong,Jianwen Su 2000 Optimization Techniques for Data-Intensive Decision Flows. ICDE Power Conservative Multi-Attribute Queries on Data Broadcast. Qinglong Hu,Wang-Chien Lee,Dik Lun Lee 2000 Power Conservative Multi-Attribute Queries on Data Broadcast. ICDE Speeding up View Maintenance Using Cheap Filters at the Warehouse. Nam Huyn 2000 Speeding up View Maintenance Using Cheap Filters at the Warehouse. ICDE A Novel Deadline Driven Disk Scheduling Algorithm for Multi-Priority Multimedia Objects. Ibrahim Kamel,T. Niranjan,Shahram Ghandeharizadeh 2000 A Novel Deadline Driven Disk Scheduling Algorithm for Multi-Priority Multimedia Objects. ICDE Efficient Storage of XML Data. Carl-Christian Kanne,Guido Moerkotte 2000 Efficient Storage of XML Data. ICDE Query Planning with Limited Source Capabilities. Chen Li,Edward Y. Chang 2000 Query Planning with Limited Source Capabilities. ICDE Analyzing Range Queries on Spatial Data. Ji Jin,Ning An,Anand Sivasubramaniam 2000 Analyzing Range Queries on Spatial Data. ICDE Optimal Index and Data Allocation in Multiple Broadcast Channels. Shou-Chih Lo,Arbee L. P. Chen 2000 Optimal Index and Data Allocation in Multiple Broadcast Channels. ICDE Similarity Search for Multidimensional Data Sequences. Seok-Lyong Lee,Seok-Ju Chun,Deok-Hwan Kim,Ju-Hong Lee,Chin-Wan Chung 2000 Similarity Search for Multidimensional Data Sequences. ICDE XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. Ling Liu,Calton Pu,Wei Han 2000 XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. ICDE Approximate Query Answering with Frequent Sets and Maximum Entropy. Heikki Mannila,Padhraic Smyth 2000 Approximate Query Answering with Frequent Sets and Maximum Entropy. ICDE Scalable Algorithms for Large Temporal Aggregation. Bongki Moon,Inés Fernando Vega López,Vijaykumar Immanuel 2000 Scalable Algorithms for Large Temporal Aggregation. ICDE Database Technology for Internet Applications (Abstract). Anil Nori 2000 Database Technology for Internet Applications (Abstract). ICDE DISIMA: An Object-Oriented Approach to Developing an Image Database System. Vincent Oria,M. Tamer Özsu,Paul Iglinski,Bing Xu,L. Irene Cheng 2000 DISIMA: An Object-Oriented Approach to Developing an Image Database System. ICDE A Multimedia Information Server with Mixed Workload Scheduling. Guido Nerjes 2000 A Multimedia Information Server with Mixed Workload Scheduling. ICDE Deflating the Dimensionality Curse Using Multiple Fractal Dimensions. Bernd-Uwe Pagel,Flip Korn,Christos Faloutsos 2000 Deflating the Dimensionality Curse Using Multiple Fractal Dimensions. ICDE Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases. Sanghyun Park,Wesley W. Chu,Jeehee Yoon,Chih-Cheng Hsu 2000 Efficient Searches for Similar Subsequences of Different Lengths in Sequence Databases. ICDE Landmarks: a New Model for Similarity-based Pattern Querying in Time Series Databases. Chang-Shing Perng,Haixun Wang,Sylvia R. Zhang,Douglas Stott Parker Jr. 2000 Landmarks: a New Model for Similarity-based Pattern Querying in Time Series Databases. ICDE Multi-Level Multi-Channel Air Cache Designs for Broadcasting in a Mobile Environment. Kiran Prabhakara,Kien A. Hua,Jung-Hwan Oh 2000 Multi-Level Multi-Channel Air Cache Designs for Broadcasting in a Mobile Environment. ICDE Taming the Downtime: High Availability in Sybase ASE 12. S. Raghuram,Sheshadri Ranganath,Steve Olson,Subrata Nandi 2000 Taming the Downtime: High Availability in Sybase ASE 12. ICDE Extracting Delta for Incremental Data Warehouse Maintenance. Prabhu Ram,Lyman Do 2000 Extracting Delta for Incremental Data Warehouse Maintenance. ICDE On-Line Schema Update for a Telecom Database. Mikael Ronström 2000 On-Line Schema Update for a Telecom Database. ICDE Metadata Propagation in Large, Multi-Layer Database Systems. Arnon Rosenthal,Edward Sciore 2000 Metadata Propagation in Large, Multi-Layer Database Systems. ICDE A Semi-Structured Data Cartridge for Relational Databases. Fei Sha,Georges Gardarin,Laurent Némirovski 2000 A Semi-Structured Data Cartridge for Relational Databases. ICDE SQLServer for Windows CE - A Database Engine for Mobile and Embedded Platforms. Praveen Seshadri,Phil Garrett 2000 SQLServer for Windows CE - A Database Engine for Mobile and Embedded Platforms. ICDE The Collaboration Management Infrastructure. Hans Schuster,Donald Baker,Andrzej Cichocki,Dimitrios Georgakopoulos,Marek Rusinkiewicz 2000 The Collaboration Management Infrastructure. ICDE Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering. Giedrius Slivinskas,Christian S. Jensen,Richard T. Snodgrass 2000 Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering. ICDE Extensible Indexing: A Framework for Integrating Domain-Specific Indexing Schemes into Oracle8i. Jagannathan Srinivasan,Ravi Murthy,Seema Sundara,Nipun Agarwal,Samuel DeFazio 2000 Extensible Indexing: A Framework for Integrating Domain-Specific Indexing Schemes into Oracle8i. ICDE Directories: Managing Data for Networked Applications. Divesh Srivastava 2000 Directories: Managing Data for Networked Applications. ICDE ReQueSS: Relational Querying of Semi-Structured Data. Rajshekhar Sunderraman 2000 ReQueSS: Relational Querying of Semi-Structured Data. ICDE Assisting the Integration of Taxonomic Data: The LITCHI Toolkit. Iain Sutherland,John S. Robinson,Sue M. Brandt,Andrew C. Jones,Suzanne M. Embury,W. A. Gray,Richard J. White,Frank A. Bisby 2000 Assisting the Integration of Taxonomic Data: The LITCHI Toolkit. ICDE Mining Bases for Association Rules Using Closed Sets. Rafik Taouil,Nicolas Pasquier,Yves Bastide,Lotfi Lakhal 2000 Mining Bases for Association Rules Using Closed Sets. ICDE In-Memory Data Management in the Application Tier. Times-Ten Team 2000 In-Memory Data Management in the Application Tier. ICDE Creating a Customized Access Method for Blobworld. Megan Thomas,Chad Carson,Joseph M. Hellerstein 2000 We present the design and analysis of a customized access method for the content-based image retrieval system, Blobworld. Using the amdb access method analysis tool, we analyzed three existing multidimensional access methods to support nearest neighbor search in the context of the Blobworld application. Based on this analysis, we propose several variants of the R-tree, tailored to address the problems the analysis revealed. We implemented the access methods we propose in the Generalized Search Trees (GiST) framework and analyzed them. We found that two of our access methods have better performance characteristics for the Blobworld application than any of the traditional multi-dimensional access methods we examined. Based on this experience, we draw conclusions for nearest neighbor access method design, and for the task of constructing custom access methods tailored to particular applications. ICDE Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. Caetano Traina Jr.,Agma J. M. Traina,Christos Faloutsos 2000 Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. ICDE The Changing Art of Computer Research. Dennis Tsichritzis 2000 The Changing Art of Computer Research. ICDE DB2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes. Gary Valentin,Michael Zuliani,Daniel C. Zilio,Guy M. Lohman,Alan Skelley 2000 DB2 Advisor: An Optimizer Smart Enough to Recommend Its Own Indexes. ICDE Location Prediction and Queries for Tracking Moving Objects. Ouri Wolfson,Bo Xu,Sam Chamberlain 2000 Location Prediction and Queries for Tracking Moving Objects. ICDE User Defined Aggregates in Object-Relational Systems. Haixun Wang,Carlo Zaniolo 2000 User Defined Aggregates in Object-Relational Systems. ICDE CMP: A Fast Decision Tree Classifier Using Multivariate Predictions. Haixun Wang,Carlo Zaniolo 2000 CMP: A Fast Decision Tree Classifier Using Multivariate Predictions. ICDE Pure Java Databases for Deployed Applications. Nat Wyatt 2000 Pure Java Databases for Deployed Applications. ICDE Interactive-Time Similarity Search for Large Image Collections Using Parallel VA-Files. Roger Weber,Klemens Böhm,Hans-Jörg Schek 2000 Interactive-Time Similarity Search for Large Image Collections Using Parallel VA-Files. ICDE The Mentor-Lite Prototype: A Light-Weight Workflow Management System. Jeanine Weißenfels,Michael Gillmann,Olivier Roth,German Shegalov,Wolfgang Wonner 2000 The Mentor-Lite Prototype: A Light-Weight Workflow Management System. ICDE Web Query Optimizer. Vladimir Zadorozhny,Laura Bright,Louiqa Raschid,Tolga Urhan,Maria-Esther Vidal 2000 Web Query Optimizer. ICDE Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. Osmar R. Zaïane,Jiawei Han,Hua Zhu 2000 Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE Association-Based Multiple Imputation in Multivariate Datasets: A Summary. Weixiong Zhang 2000 Association-Based Multiple Imputation in Multivariate Datasets: A Summary. ICDE Clustering Categorical Data. Zhang Yi,Ada Wai-Chee Fu,Chun Hing Cai,Pheng-Ann Heng 2000 Clustering Categorical Data. ICDE Image Database Retrieval with Multiple-Instance Learning Techniques. Cheng Yang,Tomás Lozano-Pérez 2000 Image Database Retrieval with Multiple-Instance Learning Techniques. ICDE Online Data Mining for Co-Evolving Time Sequences. Byoung-Kee Yi,Nikolaos Sidiropoulos,Theodore Johnson,H. V. Jagadish,Christos Faloutsos,Alexandros Biliris 2000 Online Data Mining for Co-Evolving Time Sequences. ICDE Developing Cost Models with Qualitative Variables for Dynamic Multidatabase Environments. Qiang Zhu,Yu Sun,Satyanarayana Motheramgari 2000 Developing Cost Models with Qualitative Variables for Dynamic Multidatabase Environments. ICDE Probabilistic Data Consistency for Wide-Area Applications. Hengming Zou,Nandit Soparkar,Farnam Jahanian 2000 Probabilistic Data Consistency for Wide-Area Applications. ICDE ACQ: An Automatic Clustering and Querying Approach for Large Image Databases. Dantong Yu,Aidong Zhang 2000 ACQ: An Automatic Clustering and Querying Approach for Large Image Databases. ICDE Proceedings of the 16th International Conference on Data Engineering, 28 February - 3 March, 2000, San Diego, California, USA 2000 Proceedings of the 16th International Conference on Data Engineering, 28 February - 3 March, 2000, San Diego, California, USA SIGMOD Conference Indexing Images in Oracle8i. Melliyal Annamalai,Rajiv Chopra,Samuel DeFazio 2000 Indexing Images in Oracle8i. SIGMOD Conference Congressional Samples for Approximate Answering of Group-By Queries. Swarup Acharya,Phillip B. Gibbons,Viswanath Poosala 2000 In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex decision support queries using precomputed summary statistics, such as samples. Decision support queries routinely segment the data into groups and then aggregate the information in each group (group-by queries). Depending on the data, there can be a wide disparity between the number of data items in each group. As a result, approximate answers based on uniform random samples of the data can result in poor accuracy for groups with very few data items, since such groups will be represented in the sample by very few (often zero) tuples. In this paper, we propose a general class of techniques for obtaining fast, highly-accurate answers for group-by queries. These techniques rely on precomputed non-uniform (biased) samples of the data. In particular, we propose congressional samples, a hybrid union of uniform and biased samples. Given a fixed amount of space, congressional samples seek to maximize the accuracy for all possible group-by queries on a set of columns. We present a one pass algorithm for constructing a congressional sample and use this technique to also incrementally maintain the sample up-to-date without accessing the base relation. We also evaluate query rewriting strategies for providing approximate answers from congressional samples. Finally, we conduct an extensive set of experiments on the TPC-D database, which demonstrates the efficacy of the techniques proposed. SIGMOD Conference Privacy-Preserving Data Mining. Rakesh Agrawal,Ramakrishnan Srikant 2000 A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data. SIGMOD Conference A Framework for Expressing and Combining Preferences. Rakesh Agrawal,Edward L. Wimmers 2000 The advent of the World Wide Web has created an explosion in the available on-line information. As the range of potential choices expand, the time and effort required to sort through them also expands. We propose a formal framework for expressing and combining user preferences to address this problem. Preferences can be used to focus search queries and to order the search results. A preference is expressed by the user for an entity which is described by a set of named fields; each field can take on values from a certain type. The * symbol may be used to match any element of that type. A set of preferences can be combined using a generic combine operator which is instantiated with a value function, thus providing a great deal of flexibility. Same preferences can be combined in more than one way and a combination of preferences yields another preference thus providing the closure property. We demonstrate the power of our framework by illustrating how a currently popular personalization system and a real-life application can be realized as special cases of our framework. We also discuss implementation of the framework in a relational setting. SIGMOD Conference Finding Generalized Projected Clusters In High Dimensional Spaces. Charu C. Aggarwal,Philip S. Yu 2000 High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by inter-attribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely ta tradeoff with better accuracy. SIGMOD Conference Data Management in eCommerce: The Good, the Bad, and the Ugly. Avigdor Gal 2000 Data Management in eCommerce: The Good, the Bad, and the Ugly. SIGMOD Conference javax.XXL: A prototype for a Library of Query processing Algorithms. Jochen Van den Bercken,Jens-Peter Dittrich,Bernhard Seeger 2000 Therefore, index structures can easily be used in queries. A typical example is a join cursor which consumes the outputs of two underlying cursors. Most of our work is however not dedicated to the area of relational databases, but mainly refers to spatial and temporal data. For spatial databases, for example, we provide several implementations of spatial join algorithms [3]. The cursor-based processing is however the major advantage of XXL in contrast to approaches like LEDA [6] and TPIE [7]. For more information on XXL see http://www.mathematik.uni-marburg.de/DBS/xxl.We will demonstrate the latest version of XXL using examples to show its core functionality. We will concentrate on three key aspects of XXL.Usage: We show how easily state-of-the-art spatial join-algorithms can be implemented in XXL using data from different sources. Reuse: We will demonstrate how to support different joins, e.g. spatial and temporal joins, using the same generic algorithm like Plug&Join [1].Comparability: We will demonstrate how XXL serves as an ideal testbed to compare query processing algorithms and index structures. SIGMOD Conference TerraServer: A Spatial Data Warehouse. Tom Barclay,Donald R. Slutz,Jim Gray 2000 TerraServer: A Spatial Data Warehouse. SIGMOD Conference Tutorial: Data Access. José A. Blakeley,Anand Deshpande 2000 Tutorial: Data Access. SIGMOD Conference Tutorial: Designing an Ultra Highly Available DBMS. Svein Erik Bratsberg,Øystein Torbjørnsen 2000 Tutorial: Designing an Ultra Highly Available DBMS. SIGMOD Conference Integrating Replacement Policies in StorM: An Extensible Approach. Chong Leng Goh,Beng Chin Ooi,Stéphane Bressan,Kian-Lee Tan 2000 Integrating Replacement Policies in StorM: An Extensible Approach. SIGMOD Conference LOF: Identifying Density-Based Local Outliers. Markus M. Breunig,Hans-Peter Kriegel,Raymond T. Ng,Jörg Sander 2000 For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical. SIGMOD Conference "On XML and Databases: Where's the Beef? (Panel Abstract)." Michael J. Carey 2000 "On XML and Databases: Where's the Beef? (Panel Abstract)." SIGMOD Conference The Onion Technique: Indexing for Linear Optimization Queries. Yuan-Chi Chang,Lawrence D. Bergman,Vittorio Castelli,Chung-Sheng Li,Ming-Ling Lo,John R. Smith 2000 This paper describes the Onion technique, a special indexing structure for linear optimization queries. Linear optimization queries ask for top-N records subject to the maximization or minimization of linearly weighted sum of record attribute values. Such query appears in many applications employing linear models and is an effective way to summarize representative cases, such as the top-50 ranked colleges. The Onion indexing is based on a geometric property of convex hull, which guarantees that the optimal value can always be found at one or more of its vertices. The Onion indexing makes use of this property to construct convex hulls in layers with outer layers enclosing inner layers geometrically. A data record is indexed by its layer number or equivalently its depth in the layered convex hull. Queries with linear weightings issued at run time are evaluated from the outmost layer inwards. We show experimentally that the Onion indexing achieves orders of magnitude speedup against sequential linear scan when N is small compared to the cardinality of the set. The Onion technique also enables progressive retrieval, which processes and returns ranked results in a progressive manner. Furthermore, the proposed indexing can be extended into a hierarchical organization of data to accommodate both global and local queries. SIGMOD Conference Internet Traffic Warehouse. Chung-Min Chen,Munir Cochinwala,Claudio Petrone,Marc Pucci,Sunil Samtani,Patrizia Santa,Marco Mesiti 2000 Internet Traffic Warehouse. SIGMOD Conference Fact: A Learning Based Web Query Processing System. Songting Chen,Yanlei Diao,Hongjun Lu,Zengping Tian 2000 Though the query is posted in key words, the returned results contain exactly the information that the user is querying for, which may not be explicitly specified in the input query.The required information is often not contained in the Web pages whose URLs are returned by a search engine. FACT is capable of navigating in the neighborhood of these pages to find those that really contain the queried segments.The system does not require a prior knowledge about users such as user profiles [1] or preprocessing of Web pages such as wrapper generation [2].A prototype system has been implemented using the approach. It learns and applies two types of knowledge, navigation knowledge for following hyperlinks and classification knowledge for queried segment identification. For learning, it supports three training strategies, namely sequential training, random training and interleaved training. Yahoo! is currently the external search engine. The URLs of Web pages returned by the external search engine are used in processing. A set of experiments that are designed to evaluate the system, and compare different implementations, such as knowledge representations and training strategies. SIGMOD Conference NiagaraCQ: A Scalable Continuous Query System for Internet Databases. Jianjun Chen,David J. DeWitt,Feng Tian,Yuan Wang 2000 "Continuous queries are persistent queries that allow users to receive new results when they become available. While continuous query systems can transform a passive web into an active environment, they need to be able to support millions of queries due to the scale of the Internet. No existing systems have achieved this level of scalability. NiagaraCQ addresses this problem by grouping continuous queries based on the observation that many web queries share similar structures. Grouped queries can share the common computation, tend to fit in memory and can reduce the I/O cost significantly. Furthermore, grouping on selection predicates can eliminate a large number of unnecessary query invocations. Our grouping technique is distinguished from previous group optimization approaches in the following ways. First, we use an incremental group optimization strategy with dynamic re-grouping. New queries are added to existing query groups, without having to regroup already installed queries. Second, we use a query-split scheme that requires minimal changes to a general-purpose query engine. Third, NiagaraCQ groups both change-based and timer-based queries in a uniform way. To insure that NiagaraCQ is scalable, we have also employed other techniques including incremental evaluation of continuous queries, use of both pull and push models for detecting heterogeneous data source changes, and memory caching. This paper presents the design of NiagaraCQ system and gives some experimental results on the system's performance and scalability." SIGMOD Conference Synchronizing a Database to Improve Freshness. Junghoo Cho,Hector Garcia-Molina 2000 In this paper we study how to refresh a local copy of an autonomous data source to maintain the copy up-to-date. As the size of the data grows, it becomes more difficult to maintain the copy \ fresh, “making it crucial to synchronize the copy effectively. We define two freshness metrics, change models of the underlying data, and synchronization policies. We analytically study how effective the various policies are. We also experimentally verify our analysis, based on data collected from 270 web sites for more than 4 months, and we show that our new policy improves the \ freshness” very significantly compared to current policies in use. SIGMOD Conference Finding Replicated Web Collections. Junghoo Cho,Narayanan Shivakumar,Hector Garcia-Molina 2000 Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web. SIGMOD Conference On Wrapping Query Languages and Efficient XML Integration. Vassilis Christophides,Sophie Cluet,Jérôme Siméon 2000 Modern applications (Web portals, digital libraries, etc.) require integrated access to various information sources (from traditional DBMS to semistructured Web repositories), fast deployment and low maintenance cost in a rapidly evolving environment. Because of its flexibility, there is an increasing interest in using XML as a middleware model for such applications. XML enables fast wrapping and declarative integration. However, query processing in XML-based integration systems is still penalized by the lack of an algebra with adequate optimization properties and the difficulty to understand source query capabilities. In this paper, we propose an algebraic approach to support efficient XML query evaluation. We define a general purpose algebra suitable for semistructured on XML query languages. We show how this algebra can be used, with appropriate type information, to also wrap more structured query languages such as OQL or SQL. Finally, we develop new optimization techniques for XML-based integration systems. SIGMOD Conference Closest Pair Queries in Spatial Databases. Antonio Corral,Yannis Manolopoulos,Yannis Theodoridis,Michael Vassilakopoulos 2000 This paper addresses the problem of finding the K closest pairs between two spatial data sets, where each set is stored in a structure belonging in the R-tree family. Five different algorithms (four recursive and one iterative) are presented for solving this problem. The case of 1 closest pair is treated as a special case. An extensive study, based on experiments performed with synthetic as well as with real point data sets, is presented. A wide range of values for the basic parameters affecting the performance of the algorithms, especially the effect of overlap between the two data sets, is explored. Moreover, an algorithmic as well as an experimental comparison with existing incremental algorithms addressing the same problem is presented. In most settings, the new algorithms proposed clearly outperform the existing ones. SIGMOD Conference Spatial Join Selectivity Using Power Laws. Christos Faloutsos,Bernhard Seeger,Agma J. M. Traina,Caetano Traina Jr. 2000 We discovered a surprising law governing the spatial join selectivity across two sets of points. An example of such a spatial join is “find the libraries that are within 10 miles of schools”. Our law dictates that the number of such qualifying pairs follows a power law, whose exponent we call “pair-count exponent” (PC). We show that this law also holds for self-spatial-joins (“find schools within 5 miles of other schools”) in addition to the general case that the two point-sets are distinct. Our law holds for many real datasets, including diverse environments (geographic datasets, feature vectors from biology data, galaxy data from astronomy). In addition, we introduce the concept of the Box-Occupancy-Product-Sum (BOPS) plot, and we show that it can compute the pair-count exponent in a timely manner, reducing the run time by orders of magnitude, from quadratic to linear. Due to the pair-count exponent and our analysis (Law 1), we can achieve accurate selectivity estimates in constant time (O(1)) without the need for sampling or other expensive operations. The relative error in selectivity is about 30% with our fast BOPS method, and even better (about 10%), if we use the slower, quadratic method. SIGMOD Conference A Data Model and Data Structures for Moving Objects Databases. Luca Forlizzi,Ralf Hartmut Güting,Enrico Nardelli,Markus Schneider 2000 We consider spatio-temporal databases supporting spatial objects with continuously changing position and extent, termed moving objects databases. We formally define a data model for such databases that includes complex evolving spatial structures such as line networks or multi-component regions with holes. The data model is given as a collection of data types and operations which can be plugged as attribute types into any DBMS data model (e.g. relational, or object-oriented) to obtain a complete model and query language. A particular novel concept is the sliced representation which represents a temporal development as a set of units, where unit types for spatial and other data types represent certain “simple” functions of time. We also show how the model can be mapped into concrete physical data structures in a DBMS environment. SIGMOD Conference lambda-DB: An ODMG-Based Object-Oriented DBMS. Leonidas Fegaras,Chandrasekhar Srinivasan,Arvind Rajendran,David Maier 2000 The &lgr;-DB project at the University of Texas at Arlington aims at developing frameworks and prototype systems that address the new query optimization challenges for object-oriented and object-relational databases, such as query nesting, multiple collection types, methods, and arbitrary nesting of collections. We have already developed a theoretical framework for query optimization based on an effective calculus, called the monoid comprehension calculus [4]. The system reported here is a fully operational ODMG 2.0 [2] OODB management system, based on this framework. Our system can handle most ODL declarations and can process most OQL query forms. &lgr;-DB is not ODMG compliant. Instead it supports its own C++ binding that provides a seamless integration between OQL and C++ with low impedance mismatch. It allows C++ variables to be used in queries and results of queries to be passed back to C++ programs. Programs expressed in our C++ binding are compiled by a preprocessor that performs query optimization at compile time, rather than run-time, as it is proposed by ODMG. In addition to compiled queries, &lgr;-DB provides an interpreter that evaluates ad-hoc OQL queries at run-time. The &lgr;-DB system architecture is shown in Figure 1. The &lgr;-DB evaluation engine is written in SDL (the SHORE Data Language) of the SHORE object management system [1], developed at the University of Wisconsin. ODL schemas are translated into SDL schemas in a straightforward way and are stored in the system catalog. The &lgr;-DB OQL compiler is a C++ preprocessor that accepts a language called &lgr;-OQL, which is C++ code with embedded DML commands to perform transactions, queries, updates, etc. The preprocessor translates &lgr;-OQL programs into C++ code that contains calls to the &lgr;-DB evaluation engine. We also provide a visual query formulation interface, called VOODOO, and a translator from visual queries to OQL text, which can be sent to the &lgr;-DB OQL interpreter for evaluation.Even though a lot of effort has been made to make the implementation of our system simple enough for other database researchers to use and extend, our system is quite sophisticated since it employs current state-of-the-art query optimization technologies as well as new advanced experimental optimization techniques which we have developed through the years, such as query unnesting [3]. The &lgr;-DB OODBMS is available as an open source software through the web at http://lambda.uta.edu/lambda-DB.html SIGMOD Conference AJAX: An Extensible Data Cleaning Tool Helena Galhardas,Daniela Florescu,Dennis Shasha,Eric Simon 2000 @@@@ groups together matching pairs with a high similarity value by applying a given grouping criteria (e.g. by transitive closure). Finally, ging collapses each individual cluster into a tuple of the resulting data source. AJAX provides @@@@ for specifying data cleaning programs, which consists of SQL statements enriched with a set of specific primitives to express these transformations.AJAX also @@@@. It allows the user to interact with an executing data cleaning program to handle exceptional cases and to inspect intermediate results. Finally, AJAX provides @@@@ @@@@ that permits users to determine the source and processing of data for debugging purposes.We will present the AJAX system applied to two real world problems: the consolidation of a telecommunication database, and the conversion of a dirty database of bibliographic references into a set of clean, normalized, and redundancy free relational tables maintaining the same data. SIGMOD Conference WSQ/DSQ: A Practical Approach for Combined Querying of Databases and the Web. Roy Goldman,Jennifer Widom 2000 We present WSQ/DSQ (pronounced “wisk-disk”), a new approach for combining the query facilities of traditional databases with existing search engines on the Web. WSQ, for Web-Supported (Database) Queries, leverages results from Web searches to enhance SQL queries over a relational database. DSQ, for Database-Supported (Web) Queries, uses information stored in the database to enhance and explain Web searches. This paper focuses primarily on WSQ, describing a simple, low-overhead way to support WSQ in a relational DBMS, and demonstrating the utility of WSQ with a number of interesting queries and results. The queries supported by WSQ are enabled by two virtual tables, whose tuples represent Web search results generated dynamically during query execution. WSQ query execution may involve many high-latency calls to one or more search engines, during which the query processor is idle. We present a lightweight technique called asynchronous iteration that can be integrated easily into a standard sequential query processor to enable concurrency between query processing and multiple Web search requests. Asynchronous iteration has broader applications than WSQ alone, and it opens up many interesting query optimization issues. We have developed a prototype implementation of WSQ by extending a DBMS with virtual tables and asynchronous iteration; performance results are reported. SIGMOD Conference XTRACT: A System for Extracting Document Type Descriptors from XML Documents. Minos N. Garofalakis,Aristides Gionis,Rajeev Rastogi,S. Seshadri,Kyuseok Shim 2000 "XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate “general” candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate the effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases." SIGMOD Conference A Goal-driven Auto-Configuration Tool for the Distributed Workflow Management System Mentor-lite. Michael Gillmann,Jeanine Weißenfels,German Shegalov,Wolfgang Wonner,Gerhard Weikum 2000 A Goal-driven Auto-Configuration Tool for the Distributed Workflow Management System Mentor-lite. SIGMOD Conference Approximating Multi-Dimensional Aggregate Range Queries over Real Attributes. Dimitrios Gunopulos,George Kollios,Vassilis J. Tsotras,Carlotta Domeniconi 2000 Finding approximate answers to multi-dimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper we consider the following problem: given a table of d attributes whose domain is the real numbers, and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. We present a new histogram technique that is designed to approximate the density of multi-dimensional datasets with real attributes. Our technique finds buckets of variable size, and allows the buckets to overlap. Overlapping buckets allow more efficient approximation of the density. The size of the cells is based on the local density of the data. This technique leads to a faster and more compact approximation of the data distribution. We also show how to generalize kernel density estimators, and how to apply them on the multi-dimensional query approximation problem. Finally, we compare the accuracy of the proposed techniques with existing techniques using real and synthetic datasets. SIGMOD Conference Mining Frequent Patterns without Candidate Generation. Jiawei Han,Jian Pei,Yiwen Yin 2000 Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods. SIGMOD Conference Index Research: Forest or Trees? (Panel Abstract). Joseph M. Hellerstein 2000 Index Research: Forest or Trees? (Panel Abstract). SIGMOD Conference Eddies: Continuously Adaptive Query Processing Ron Avnur,Joseph M. Hellerstein 2000 In large federated and shared-nothing databases, resources can exhibit widely fluctuating characteristics. Assumptions made at the time a query is submitted will rarely hold throughout the duration of query processing. As a result, traditional static query optimization and execution techniques are ineffective in these environments. In this paper we introduce a query processing mechanism called an eddy, which continuously reorders operators in a query plan as it runs. We characterize the moments of symmetry during which pipelined joins can be easily reordered, and the synchronization barriers that require inputs from different sources to be coordinated. By combining eddies with appropriate join algorithms, we merge the optimization and execution phases of query processing, allowing each tuple to have a flexible ordering of the query operators. This flexibility is controlled by a combination of fluid dynamics and a simple learning algorithm. Our initial implementation demonstrates promising results, with eddies performing nearly as well as a static optimizer/executor in static scenarios, and providing dramatic improvements in dynamic execution environments. SIGMOD Conference DLFM: A Transactional Resource Manager. Hui-I Hsiao,Inderpal Narang 2000 "The DataLinks technology developed at IBM Almaden Research Center and now available in DB2 UDB 5.2 introduces a new data type called DATALINK for a database to reference and manage files stored external to the database. An external file is put under a database control by “linking” the file to the database. Control to a file can also be removed by “unlinking” it. The technology provides transactional semantics with respect to linking or unlinking the file when DATALINK value is stored or updated. Further more, it provides the following set of properties: (1) managing access control to linked files, (2) enforcing referential integrity, such as referenced file cannot be deleted or renamed as long as it is referenced from the RDBMS, and (3) providing coordinated backup and recovery of RDBMS data with the file data. DataLinks File Manager (DLFM) is a key component of the DataLinks technology. DLFM is a sophisticated SQL application with a set of daemon processes residing at a file server node that work cooperatively with the host database server(s) to manage external files. To reduce the number of messages between database server and DLFM, DLFM maintains a set of meta data on the file system and the files that are under database control. One of the major decisions we made was to build DLFM on top of an existing database manager, such as DB2, instead of implementing a proprietary persistent data store. We have mixed feelings about using the RDBMS to build such a resource manager. One of the major challenges is to support transactional semantics for DLFM operations. To do this, we implemented the two-phase commit protocol in DLFM and designed an innovative scheme to enable rolling back transaction update after local database commit. Also a major gotchas is that the RDBMS' cost based optimizer generates the access plan, which does not take into account the locking costs of a concurrent workload. Using the RDBMS as a black box can cause “havoc” in terms of causing the lock timeouts and deadlocks and reducing the throughput of a concurrent workload. To solve the problem, we came up with a simple but effective way of influencing the optimizer to generate access plans matching the needs of DLFM implementation. Also several precautions had to be taken to ensure that lock escalation did not take place; next key locking was disabled to avoid deadlocks on heavily used indexes and SQL tables; and timeout mechanism was applied to break global deadlocks. We were able to run 100-client workload for 24 hours without much deadlock/timeout problem in system test. This paper describes the motivation for building the DLFM and the lessons that we have learned from this experience." SIGMOD Conference Image Mining in IRIS: Integrated Retinal Information System. Wynne Hsu,Mong-Li Lee,Kheng Guan Goh 2000 There is an increasing demand for systems that can automatically analyze images and extract semantically meaningful information. IRIS, an Integrated Retinal Information system, has been developed to provide medical professionals easy and unified access to the screening, trend and progression of diabetic-related eye diseases in a diabetic patient database. This paper shows how mining techniques can be used to accurately extract features in the retinal images. In particular, we apply a classification approach to determine the conditions for tortuousity in retinal blood vessels. SIGMOD Conference Self-Organizing Data Sharing Communities with SAGRES. Zachary G. Ives,Alon Y. Levy,Jayant Madhavan,Rachel Pottinger,Stefan Saroiu,Igor Tatarinov,Shiori Betzler,Qiong Chen,Ewa Jaslikowska,Jing Su,Wai Tak Theodora Yeung 2000 Self-Organizing Data Sharing Communities with SAGRES. SIGMOD Conference On Effective Multi-Dimensional Indexing for Strings. H. V. Jagadish,Nick Koudas,Divesh Srivastava 2000 As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string data. In this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data. The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided tries, one per dimension, in an index page to prune search during traversal of index pages. These basic ideas affect all index algorithms. In this paper, we present efficient algorithms for different types of string matching. While our technique is applicable to a wide range of multi-dimensional index structures, we instantiate our generic techniques by adapting the 2-dimensional R-tree to string data. We demonstrate the space effectiveness and time benefits of using the string R-tree both analytically and experimentally. SIGMOD Conference Concept Based Design of Data Warehouses: The DWQ Demonstrators. Matthias Jarke,Christoph Quix,Diego Calvanese,Maurizio Lenzerini,Enrico Franconi,Spyros Ligoudistianos,Panos Vassiliadis,Yannis Vassiliou 2000 The ESPRIT Project DWQ (Foundations of Data Warehouse Quality) aimed at improving the quality of DW design and operation through systematic enrichment of the semantic foundations of data warehousing. Logic-based knowledge representation and reasoning techniques were developed to control accuracy, consistency, and completeness via advanced conceptual modeling techniques for source integration, data reconciliation, and multi-dimensional aggregation. This is complemented by quantitative optimization techniques for view materialization, optimizing timeliness and responsiveness without losing the semantic advantages from the conceptual approach. At the operational level, query rewriting and materialization refreshment algorithms exploit the knowledge developed at design time. The demonstration shows the interplay of these tools under a shared metadata repository, based on an example extracted from an application at Telecom Italia. SIGMOD Conference Anatomy of a Real E-Commerce System. Anant Jhingran 2000 "Today's E-Commerce systems are a complex assembly of databases, web servers, home grown glue code, and networking services for security and scalability. The trend is towards larger pieces of these coming together in bundled offerings from leading software vendors, and the networking/hardware being offered through service delivery companies. In this paper we examine the bundle by looking in detail at IBM's WebSphere, Commerce Edition, and its deployment at a major customer site." SIGMOD Conference Influence Sets Based on Reverse Nearest Neighbor Queries. Flip Korn,S. Muthukrishnan 2000 Inherent in the operation of many decision support and continuous referral systems is the notion of the “influence” of a data point on the database. This notion arises in examples such as finding the set of customers affected by the opening of a new store outlet location, notifying the subset of subscribers to a digital library who will find a newly added document most relevant, etc. Standard approaches to determining the influence set of a data point involve range searching and nearest neighbor queries. In this paper, we formalize a novel notion of influence based on reverse neighbor queries and its variants. Since the nearest neighbor relation is not symmetric, the set of points that are closest to a query point (i.e., the nearest neighbors) differs from the set of points that have the query point as their nearest neighbor (called the reverse nearest neighbors). Influence sets based on reverse nearest neighbor (RNN) queries seem to capture the intuitive notion of influence from our motivating examples. We present a general approach for solving RNN queries and an efficient R-tree based method for large data sets, based on this approach. Although the RNN query appears to be natural, it has not been studied previously. RNN queries are of independent interest, and as such should be part of the suite of available queries for processing spatial and multimedia data. In our experiments with real geographical data, the proposed method appears to scale logarithmically, whereas straightforward sequential scan scales linearly. Our experimental study also shows that approaches based on range searching or nearest neighbors are ineffective at finding influence sets of our interest. SIGMOD Conference Efficient Resumption of Interrupted Warehouse Loads. Wilburt Labio,Janet L. Wiener,Hector Garcia-Molina,Vlad Gorelik 2000 Data warehouses collect large quantities of data from distributed sources into a single repository. A typical load to create or maintain a warehouse processes GBs of data, takes hours or even days to execute, and involves many complex and user-defined transformations of the data (e.g., find duplicates, resolve data inconsistencies, and add unique keys). If the load fails, a possible approach is to “redo” the entire load. A better approach is to resume the incomplete load from where it was interrupted. Unfortunately, traditional algorithms for resuming the load either impose unacceptable overhead during normal operation, or rely on the specifics of transformations. We develop a resumption algorithm called DR that imposes no overhead and relies only on the high-level properties of the transformations. We show that DR can lead to a ten-fold reduction in resumption time by performing experiments using commercial software. SIGMOD Conference WebView Materialization. Alexandros Labrinidis,Nick Roussopoulos 2000 A WebView is a web page automatically created from base data typically stored in a DBMS. Given the multi-tiered architecture behind database-backed web servers, we have the option of materializing a WebView inside the DBMS, at the web server, or not at all, always computing it on the fly (virtual). Since WebViews must be up to date, materialized WebViews are immediately refreshed with every update on the base data. In this paper we compare the three materialization policies (materialized inside the DBMS, materialized at the web server and virtual) analytically, through a detailed cost model, and quantitatively, through extensive experiments on an implemented system. Our results indicate that materializing at the web server is a more scalable solution and can facilitate an order of magnitude more users than the virtual and materialized inside the DBMS policies, even under high update workloads. SIGMOD Conference On-line Reorganization in Object Databases. Mohana Krishna Lakhamraju,Rajeev Rastogi,S. Seshadri,S. Sudarshan 2000 Reorganization of objects in an object databases is an important component of several operations like compaction, clustering, and schema evolution. The high availability requirements (24 × 7 operation) of certain application domains requires reorganization to be performed on-line with minimal interference to concurrently executing transactions. In this paper, we address the problem of on-line reorganization in object databases, where a set of objects have to be migrated from one location to another. Specifically, we consider the case where objects in the database may contain physical references to other objects. Relocating an object in this case involves finding the set of objects (parents) that refer to it, and modifying the references in each parent. We propose an algorithm called the Incremental Reorganization Algorithm (IRA) that achieves the above task with minimal interference to concurrently executing transactions. The IRA algorithm holds locks on at most two distinct objects at any point of time. We have implemented IRA on Brahma, a storage manager developed at IIT Bombay, and conducted an extensive performance study. Our experiments reveal that IRA makes on-line reorganization feasible, with very little impact on the response times of concurrently executing transactions and on overall system throughput. We also describe how the IRA algorithm can handle system failures. SIGMOD Conference Towards Self-Tuning Data Placement in Parallel Database Systems. Mong-Li Lee,Masaru Kitsuregawa,Beng Chin Ooi,Kian-Lee Tan,Anirban Mondal 2000 Parallel database systems are increasingly being deployed to support the performance demands of end-users. While declustering data across multiple nodes facilitates parallelism, initial data placement may not be optimal due to skewed workloads and changing access patterns. To prevent performance degradation, the placement of data must be reorganized, and this must be done on-line to minimize disruption to the system. In this paper, we consider a dynamic self-tuning approach to reorganization in a shared nothing system. We introduce a new index-based method that faciliates fast and efficient migration of data. Our solution incorporates a globally height-balanced structure and load tracking at different levels of granularity. We conducted an extensive performance study, and implemented the methods on the Fujitsu AP3000 machine. Both the simulation and empirical results demonstratic that our proposed method is indeed scalable and effective in correcting any deterioration in system throughput. SIGMOD Conference Maintenance of Automatic Summary Tables. Wolfgang Lehner,Richard Sidle,Hamid Pirahesh,Roberta Cochrane 2000 Maintenance of Automatic Summary Tables. SIGMOD Conference High Speed On-line Backup When Using Logical Log Operations. David B. Lomet 2000 Media recovery protects a database from failures of the stable medium by maintaining an extra copy of the database, called the backup, and a media recovery log. When a failure occurs, the database is “restored” from the backup, and the media recovery log is used to roll forward the database to the desired time, usually the current time. Backup must be both fast and “on-line”, i.e. concurrent with on-going update activity. Conventional online backup sequentially copies from the stable database, almost independent of the database cache manager, but requires page-oriented log operations. But results of logical operations must be flushed to a stable database (a backup is a stable database) in a constrained order to guarantee recovery. This order is not naturally achieved for the backup by a cache manager concerned only with crash recovery. We describe a “full speed” backup, only loosely coupled to the cache manager, and hence similar to current online backups, but effective for general logical log operations. This requires additional logging of cached objects to guarantee media recoverability. We then show how logging can be greatly reduced when log operations have a constrained form which nonetheless provides very useful additional logging efficiency for database systems. SIGMOD Conference SPIRE: A Progressive Content-Based Spatial Image Retrieval Engine. Chung-Sheng Li,Lawrence D. Bergman,Vittorio Castelli,John R. Smith 2000 SPIRE: A Progressive Content-Based Spatial Image Retrieval Engine. SIGMOD Conference XMILL: An Efficient Compressor for XML Data. Hartmut Liefke,Dan Suciu 2000 We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogeneous XML data: it uses zlib, the library function for gzip, a collection of datatype specific compressors for simple data types, and, possibly, user defined compressors for application specific data types. SIGMOD Conference LH*: A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes. Witold Litwin,Thomas J. E. Schwarz 2000 LH*RS is a new high-availability Scalable Distributed Data Structure (SDDS). The data storage scheme and the search performance of LH*RS are basically these of LH*. LH*RS manages in addition the parity information to tolerate the unavailability of k ⪈ 1 server sites. The value of k scales with the file, to prevent the reliability decline. The parity calculus uses the Reed -Solomon Codes. The storage and access performance overheads to provide the high-availability are about the smallest possible. The scheme should prove attractive to data-intensive applications. SIGMOD Conference AQR-Toolkit: An Adaptive Query Routing Middleware for Distributed Data Intensive Systems. Ling Liu,Calton Pu,David Buttler,Wei Han,Henrique Paques,Wei Tang 2000 "Query routing is an intelligent service that can direct query requests to appropriate servers that are capable of answering the queries. The goal of a query routing system is to provide efficient associative access to a large, heterogeneous, distributed collection of information providers by routing a user query to the most relevant information sources that can provide the best answer. Effective query routing not only minimizes the query response time and the overall processing cost, but also eliminates a lot of unnecessary communication overhead over the global networks and over the individual information sources.The AQR-Toolkit divides the query routing task into two cooperating processes: query refinement and source selection. It is well known that a broadly defined query inevitably produces many false positives. Query refinement provides mechanisms to help the user formulate queries that will return more useful results and that can be processed efficiently. As a complimentary process, source selection reduces false negatives by identifying and locating a set of relevant information providers from a large collection of available sources. By pruning irrelevant information sources, source selection also reduces the overhead of contacting the information servers that do not contribute to the answer of the query.The system architecture of AQR-Toolkit consists of a hierarchical network (a directed acyclic graph) with external information providers at the leaves and query routers as mediating nodes. The end-point information providers support query-based access to their documents. At a query router node, a user may browse and query the meta information about information providers registered at that query router or make use of the router's facilitates for query refinement and source selection." SIGMOD Conference Homer: a Model-Based CASE Tool for Data-Intensive Web Sites. Paolo Merialdo,Paolo Atzeni,Marco Magnante,Giansalvatore Mecca,Marco Pecorone 2000 Homer: a Model-Based CASE Tool for Data-Intensive Web Sites. SIGMOD Conference Adaptive Multi-Stage Distance Join Processing. Hyoseop Shin,Bongki Moon,Sukho Lee 2000 A spatial distance join is a relatively new type of operation introduced for spatial and multimedia database applications. Additional requirements for ranking and stopping cardinality are often combined with the spatial distance join in on-line query processing or internet search environments. These requirements pose new challenges as well as opportunities for more efficient processing of spatial distance join queries. In this paper, we first present an efficient k-distance join algorithm that uses spatial indexes such as R-trees. Bi-directional node expansion and plane-sweeping techniques are used for fast pruning of distant pairs, and the plane-sweeping is further optimized by novel strategies for selecting a sweeping axis and direction. Furthermore, we propose adaptive multi-stage algorithms for k-distance join and incremental distance join operations. Our performance study shows that the proposed adaptive multi-stage algorithms outperform previous work by up to an order of magnitude for both k-distance join and incremental distance join queries, under various operational conditions. SIGMOD Conference "Application Architecture: 2Tier or 3Tier? What is DBMS's Role? (Panel Abstract)." Anil Nori 2000 "Application Architecture: 2Tier or 3Tier? What is DBMS's Role? (Panel Abstract)." SIGMOD Conference Efficient and Cost-effective Techniques for Browsing and Indexing Large Video Databases. Jung-Hwan Oh,Kien A. Hua 2000 We present in this paper a fully automatic content-based approach to organizing and indexing video data. Our methodology involves three steps: Step 1: We segment each video into shots using a Camera-Tracking technique. This process also extracts the feature vector for each shot, which consists of two statistical variances VarBA and VarOA. These values capture how much things are changing in the background and foreground areas of the video shot. Step 2: For each video, We apply a fully automatic method to build a browsing hierarchy using the shots identified in Step 1. Step 3: Using the VarBA and VarOA values obtained in Step 1, we build an index table to support a variance-based video similarity model. That is, video scenes/shots are retrieved based on given values of VarBA and VarOA. The above three inter-related techniques offer an integrated framework for modeling, browsing, and searching large video databases. Our experimental results indicate that they have many advantages over existing methods. SIGMOD Conference SQLEM: Fast Clustering in SQL using the EM Algorithm. Carlos Ordonez,Paul Cereghini 2000 Clustering is one of the most important tasks performed in Data Mining applications. This paper presents an efficient SQL implementation of the EM algorithm to perform clustering in very large databases. Our version can effectively handle high dimensional data, a high number of clusters and more importantly, a very large number of data records. We present three strategies to implement EM in SQL: horizontal, vertical and a hybrid one. We expect this work to be useful for data mining programmers and users who want to cluster large data sets inside a relational DBMS. SIGMOD Conference DISIMA: A Distributed and Interoperable Image Database System. Vincent Oria,M. Tamer Özsu,Paul Iglinski,Shu Lin,Benjamin Bin Yao 2000 DISIMA: A Distributed and Interoperable Image Database System. SIGMOD Conference Density Biased Sampling: An Improved Method for Data Mining and Clustering. Christopher R. Palmer,Christos Faloutsos 2000 "Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions finding up to a factor of six improvement over uniform sampling." SIGMOD Conference Online Index Rebuild. Nagavamsi Ponnekanti,Hanuma Kodavalla 2000 In this paper we present an efficient method to do online rebuild of a B+-tree index. This method has been implemented in Sybase Adaptive Server Enterprise (ASE) Version 12.0. It provides high concurrency, does minimal amount of logging, has good performance and does not deadlock with other index operations. It copies the index rows to newly allocated pages in the key order so that good space utilization and clustering are achieved. The old pages are deallocated during the process. Our algorithm differs from the previously published online index rebuild algorithms in two ways. It rebuilds multiple leaf pages and then propagates the changes to higher levels. Also, while propagating the leaf level changes to higher levels, level 11 pages are reorganized, eliminating the need for a separate pass. Our performance study shows that our approach results in significant reduction in logging and CPU time. Also, our approach uses the same concurrency control mechanism as split and shrink operations, which made it attractive for implementation. SIGMOD Conference A Chase Too Far? Lucian Popa,Alin Deutsch,Arnaud Sahuguet,Val Tannen 2000 In a previous paper we proposed a novel method for generating alternative query plans that uses chasing (and back-chasing) with logical constraints. The method brings together use of indexes, use of materialized views, semantic optimization and join elimination (minimization). Each of these techniques is known separately to be beneficial to query optimization. The novelty of our approach is in allowing these techniques to interact systematically, eg. non-trivial use of indexes and materialized views may be enabled only by semantic constraints. We have implemented our method for a variety of schemas and queries. We examine how far we can push the method in term of complexity of both schemas and queries. We propose a technique for reducing the size of the search space by “stratifying” the sets of constraints used in the (back)chase. The experimental results demonstrate that our method is practical (i.e., feasible and worthwhile). SIGMOD Conference From Browsing to Interacting: DBMS Support for Responive Websites (Abstract). Raghu Ramakrishnan 2000 From Browsing to Interacting: DBMS Support for Responive Websites (Abstract). SIGMOD Conference Towards Data Mining Benchmarking: A Testbed for Performance Study of Frequent Pattern Mining. Jian Pei,Runying Mao,Kan Hu,Hua Zhu 2000 Towards Data Mining Benchmarking: A Testbed for Performance Study of Frequent Pattern Mining. SIGMOD Conference Efficient Algorithms for Mining Outliers from Large Data Sets. Sridhar Ramaswamy,Rajeev Rastogi,Kyuseok Shim 2000 In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality. SIGMOD Conference Making B-Trees Cache Conscious in Main Memory. Jun Rao,Kenneth A. Ross 2000 Previous research has shown that cache behavior is important for main memory index structures. Cache conscious index structures such as Cache Sensitive Search Trees (CSS-Trees) perform lookups much faster than binary search and T-Trees. However, CSS-Trees are designed for decision support workloads with relatively static data. Although B+-Trees are more cache conscious than binary search and T-Trees, their utilization of a cache line is low since half of the space is used to store child pointers. Nevertheless, for applications that require incremental updates, traditional B+-Trees perform well. Our goal is to make B+-Trees as cache conscious as CSS-Trees without increasing their update cost too much. We propose a new indexing technique called “Cache Sensitive B+-Trees” (CSB+-Trees). It is a variant of B+-Trees that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node. The rest of the children can be found by adding an offset to that address. Since only one child pointer is stored explicitly, the utilization of a cache line is high. CSB+-Trees support incremental updates in a way similar to B+-Trees. We also introduce two variants of CSB+-Trees. Segmented CSB+-Trees divide the child nodes into segments. Nodes within the same segment are stored contiguously and only pointers to the beginning of each segment are stored explicitly in each node. Segmented CSB+-Trees can reduce the copying cost when there is a split since only one segment needs to be moved. Full CSB+-Trees preallocate space for the full node group and thus reduce the split cost. Our performance studies show that CSB+-Trees are useful for a wide range of applications. SIGMOD Conference The MLPQ/GIS Constraint Database System. Peter Z. Revesz,Rui Chen,Pradip Kanjamala,Yiming Li,Yuguo Liu,Yonghui Wang 2000 MLPQ/GIS [4,6] is a constraint database [5] system like CCUBE [1] and DEDALE [3] but with a special emphases on spatio-temporal data. Features include data entry tools (first four icons in Fig. 1), icon-based queries such as @@@@ Intersection, @@@@ Union, @@@@ Area, @@@@ Buffer, @@@@ Max and @@@@ Min, which optimize linear objective functions, and @@@@ for Datalog queries. For example, in Fig. 1 we loaded and displayed a constraint database that represents the midwest United States and loaded two contraint relations describing the movements of two persons. The query icon opened a dialog box into which we entered the query which finds (t, i) pairs such that the two people are in the same state i at the same time t.MLPQ/GIS can animate [2] spatio-temporal objects that are linear constraint relations over and t.Users can also display in discrete color zones (isometric maps) any spatially distributed variable z that is a linear function and For example, Fig. 2 shows the mean annual air temperature Nebraska. Animation and isometric map display can be combined. SIGMOD Conference Data Mining on an OLTP System (Nearly) for Free. Erik Riedel,Christos Faloutsos,Gregory R. Ganger,David Nagle 2000 This paper proposes a scheme for scheduling disk requests that takes advantage of the ability of high-level functions to operate directly at individual disk drives. We show that such a scheme makes it possible to support a Data Mining workload on an OLTP system almost for free: there is only a small impact on the throughput and response time of the existing workload. Specifically, we show that an OLTP system has the disk resources to consistently provide one third of its sequential bandwidth to a background Data Mining task with close to zero impact on OLTP throughput and response time at high transaction loads. At low transaction loads, we show much lower impact than observed in previous work. This means that a production OLTP system can be used for Data Mining tasks without the expense of a second dedicated system. Our scheme takes advantage of close interaction with the on-disk scheduler by reading blocks for the Data Mining workload as the disk head “passes over” them while satisfying demand blocks from the OLTP request stream. We show that this scheme provides a consistent level of throughput for the background workload even at very high foreground loads. Such a scheme is of most benefit in combination with an Active Disk environment that allows the background Data Mining application to also take advantage of the processing power and memory available directly on the disk drives. SIGMOD Conference How To Roll a Join: Asynchronous Incremental View Maintenance. Kenneth Salem,Kevin S. Beyer,Roberta Cochrane,Bruce G. Lindsay 2000 How To Roll a Join: Asynchronous Incremental View Maintenance. SIGMOD Conference MOCHA: A Database Middleware System Featuring Automatic Deployment of Application-Specific Functionality. Manuel Rodriguez-Martinez,Nick Roussopoulos,John M. McGann,Stephen Kelley,Vadim Katz,Zhexuan Song,Joseph JáJá 2000 MOCHA: A Database Middleware System Featuring Automatic Deployment of Application-Specific Functionality. SIGMOD Conference MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources. Manuel Rodriguez-Martinez,Nick Roussopoulos 2000 "We present MOCHA, a new self-extensible database middleware system designed to interconnect distributed data sources. MOCHA is designed to scale to large environments and is based on the idea that some of the user-defined functionality in the system should be deployed by the middleware system itself. This is realized by shipping Java code implementing either advanced data types or tailored query operators to remote data sources and have it executed remotely. Optimized query plans push the evaluation of powerful data-reducing operators to the data source sites while executing data-inflating operators near the client's site. The Volume Reduction Factor is a new and more explicit metric introduced in this paper to select the best site to execute query operators and is shown to be more accurate than the standard selectivity factor alone. MOCHA has been implemented in Java and runs on top of Informix and Oracle. We present the architecture of MOCHA, the ideas behind it, and a performance study using scientific data and queries. The results of this study demonstrate that MOCHA provides a more flexible, scalable and efficient framework for distributed query processing compared to those in existing middleware solutions." SIGMOD Conference Indexing the Positions of Continuously Moving Objects. Simonas Saltenis,Christian S. Jensen,Scott T. Leutenegger,Mario A. Lopez 2000 The coming years will witness dramatic advances in wireless communications as well as positioning technologies. As a result, tracking the changing positions of objects capable of continuous movement is becoming increasingly feasible and necessary. The present paper proposes a novel, R*-tree based indexing technique that supports the efficient querying of the current and projected future positions of such moving objects. The technique is capable of indexing objects moving in one-, two-, and three-dimensional space. Update algorithms enable the index to accommodate a dynamic data set, where objects may appear and disappear, and where changes occur in the anticipated positions of existing objects. A comprehensive performance study is reported. SIGMOD Conference Expressing Business Rules. Ronald G. Ross 2000 Point-and-Click Expression Builders, for instance limits and type consistency.Structured English, for more complex restrictions and logical inferences.Entity Life History or State Transition Diagrams, for both basic and more advanced state transition rules.Data Model or Class Model extensions, for basic property rules.No matter how the rules are captured, there should be a single, unified conceptual representation “inside” of the man-machine boundary. “Inside” here means transparent to the specifiers, but visible to analysis tools (e.g., for conflict analysis) and to rule engines or business logic servers (for run-time processing).Inside, there may be still other representations. For processing and performance reasons, there might be many physical representations of the rules, optimized for particular tools or hardware/software environments.The result is actually three layers of representation: external, conceptual, and internal. This is strongly reminiscent of the old ANSI/SPARC three-schema architecture for data. This should not be surprising since rules simply build on terms and facts, which can be ultimately represented by data. Where is this research now? A new, more concise representation scheme is under development. One focus of this scheme is a formal expression of how non-atomic rule types are derived from atomic ones. This would allow reduction of rules to a common base of fundamental rule types, in order to support automatic analysis of conflict and overlap in systematic fashion.This is opening exciting new avenues of research, and significant opportunities for those interested in getting involved. SIGMOD Conference i: Intelligent, Interactive Investigaton of OLAP data cubes. Sunita Sarawagi,Gayatri Sathe 2000 "The goal of the i3(eye cube) project is to enhance multidimensional database products with a suite of advanced operators to automate data analysis tasks that are currently handled through manual exploration. Most OLAP products are rather simplistic and rely heavily on the user's intuition to manually drive the discovery process. Such ad hoc user-driven exploration gets tedious and error-prone as data dimensionality and size increases. We first investigated how and why analysts currently explore the data cube and then automated them using advanced operators that can be invoked interactively like existing simple operators. Our proposed suite of extensions appear in the form of a toolkit attached with a OLAP product. At this demo we will present three such operators: DIFF, RELAX and INFORM with illustrations from real-life datasets." SIGMOD Conference Efficient and Extensible Algorithms for Multi Query Optimization. Prasan Roy,S. Seshadri,S. Sudarshan,Siddhesh Bhobe 2000 Complex queries are becoming commonplace, with the growing use of decision support systems. These complex queries often have a lot of common sub-expressions, either within a single query, or across multiple such queries run as a batch. Multiquery optimization aims at exploiting common sub-expressions to reduce evaluation cost. Multi-query optimization has hither-to been viewed as impractical, since earlier algorithms were exhaustive, and explore a doubly exponential search space. In this paper we demonstrate that multi-query optimization using heuristics is practical, and provides significant benefits. We propose three cost-based heuristic algorithms: Volcano-SH and Volcano-RU, which are based on simple modifications to the Volcano search strategy, and a greedy heuristic. Our greedy heuristic incorporates novel optimizations that improve efficiency greatly. Our algorithms are designed to be easily added to existing optimizers. We present a performance study comparing the algorithms, using workloads consisting of queries from the TPC-D benchmark. The study shows that our algorithms provide significant benefits over traditional optimization, at a very acceptable overhead in optimization time. SIGMOD Conference SERFing the Web: Web Site Management Made Easy. Elke A. Rundensteiner,Kajal T. Claypool,Li Chen,Hong Su,Keiji Oenoki 2000 SERFing the Web: Web Site Management Made Easy. SIGMOD Conference Turbo-charging Vertical Mining of Large Databases. Pradeep Shenoy,Jayant R. Haritsa,S. Sudarshan,Gaurav Bhalotia,Mayank Bawa,Devavrat Shah 2000 In a vertical representation of a market-basket database, each item is associated with a column of values representing the transactions in which it is present. The association-rule mining algorithms that have been recently proposed for this representation show performance improvements over their classical horizontal counterparts, but are either efficient only for certain database sizes, or assume particular characteristics of the database contents, or are applicable only to specific kinds of database schemas. We present here a new vertical mining algorithm called VIPER, which is general-purpose, making no special requirements of the underlying database. VIPER stores data in compressed bit-vectors called “snakes” and integrates a number of novel optimizations for efficient snake generation, intersection, counting and storage. We analyze the performance of VIPER for a range of synthetic database workloads. Our experimental results indicate significant performance gains, especially for large databases, over previously proposed vertical and horizontal mining algorithms. In fact, there are even workload regions where VIPER outperforms an optimal, but practically infeasible, horizontal mining algorithm. SIGMOD Conference Tutorial: LDAP Directory Services - Just Another Database Application? Shridhar Shukla,Anand Deshpande 2000 Tutorial: LDAP Directory Services - Just Another Database Application? SIGMOD Conference Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey. Alexander S. Szalay,Peter Z. Kunszt,Ani Thakar,Jim Gray,Donald R. Slutz,Robert J. Brunner 2000 The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) is creating a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes in a 100+ dimensional space. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial and attribute indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes will accelerate frequent searches. Splitting the data among multiple servers will allow parallel, scalable I/O and parallel data analysis. Hashing techniques will allow efficient clustering, and pair-wise comparison algorithms that should parallelize nicely. Randomly sampled subsets will allow de-bugging otherwise large queries at the desktop. Central servers will operate a data pump to support sweep searches touching most of the data. The anticipated queries will require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges. SIGMOD Conference Counting, Enumerating, and Sampling of Execution Plans in a Cost-Based Query Optimizer. Florian Waas,César A. Galindo-Legaria 2000 "Testing an SQL database system by running large sets of deterministic or stochastic SQL statements is common practice in commercial database development. However, code defects often remain undetected as the query optimizer''s choice of an execution plan is not only depending on the query but strongly influenced by a large number of parameters describing the database and the hardware environment. Modifying these parameters in order to steer the optimizer to select other plans is difficult since this means anticipating often complex search strategies implemented in the optimizer. In this paper we devise algorithms for counting, exhaustive generation, and uniform sampling of plans from the complete search space. Our techniques allow extensive validation of both generation of alternatives, and execution algorithms with plans other than the optimized one---if two candidate plans fail to produce the same results, then either the optimizer considered an invalid plan, or the execution code is faulty. When the space of alternatives becomes too large for exhaustive testing, which can occur even with a handful of joins, uniform random sampling provides a mechanism for unbiased testing. The technique is implemented in Microsoft''s SQL Server, where it is an integral part of the validation and testing process." SIGMOD Conference An Approximate Search Engine for Structural Databases. Jason Tsong-Li Wang,Xiong Wang,Dennis Shasha,Bruce A. Shapiro,Kaizhong Zhang,Xinhuan Zheng,Qicheng Ma,Zasha Weinberg 2000 An Approximate Search Engine for Structural Databases. SIGMOD Conference Benchmarking Queries over Trees: Learning the Hard Truth the Hard Way. Fanny Wattez,Sophie Cluet,Véronique Benzaken,Guy Ferran,Christian Fiegel 2000 Benchmarking Queries over Trees: Learning the Hard Truth the Hard Way. SIGMOD Conference Handling Very Large Databases with Informix Extended Parallel Server. Andreas Weininger 2000 In this paper, we investigate which problems exist in very large real databases and describe which mechanisms are provided by Informix Extended Parallel Server (XPS) for dealing with these problems. Currently the largest customer XPS database contains 27 TB of data. A database server that has to handle such an amount of data has to provide mechanisms which allow achieving adequate performance and easing the usability. We will present mechanisms which address both of these issues and illustrate them with examples from real customer systems. SIGMOD Conference TIP: A Temporal Extension to Informix. Jun Yang,Huacheng C. Ying,Jennifer Widom 2000 Commercial relational database systems today provide only limited temporal support. To address the needs of applications requiring rich temporal data and queries, we have built TIP (Temporal Information Processor), a temporal extension to the Informix database system based on its DataBlade technology. Our TIP DataBlade extends Informix with a rich set of datatypes and routines that facilitate temporal modeling and querying. TIP provides both C and Java libraries for client applications to access a TIP-enabled database, and provides end-users with a GUI interface for querying and browsing temporal data. SIGMOD Conference Answering Complex SQL Queries Using Automatic Summary Tables. Markos Zaharioudakis,Roberta Cochrane,George Lapis,Hamid Pirahesh,Monica Urata 2000 We investigate the problem of using materialized views to answer SQL queries. We focus on modern decision-support queries, which involve joins, arithmetic operations and other (possibly user-defined) functions, aggregation (often along multiple dimensions), and nested subqueries. Given the complexity of such queries, the vast amounts of data upon which they operate, and the requirement for interactive response times, the use of materialized views (MVs) of similar complexity is often mandatory for acceptable performance. We present a novel algorithm that is able to rewrite a user query so that it will access one or more of the available MVs instead of the base tables. The algorithm extends prior work by addressing the new sources of complexity mentioned above, that is, complex expressions, multidimensional aggregation, and nested subqueries. It does so by relying on a graphical representation of queries and a bottom-up, pair-wise matching of nodes from the query and MV graphs. This approach offers great modularity and extensibility, allowing for the rewriting of a large class of queries. SIGMOD Conference Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA. Weidong Chen,Jeffrey F. Naughton,Philip A. Bernstein 2000 Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA. VLDB Automated Selection of Materialized Views and Indexes in SQL Databases. Sanjay Agrawal,Surajit Chaudhuri,Vivek R. Narasayya 2000 Automated Selection of Materialized Views and Indexes in SQL Databases. VLDB Push Technology Personalization through Event Correlation. Asaf Adi,David Botzer,Opher Etzion,Tali Yatzkar-Haham 2000 Push Technology Personalization through Event Correlation. VLDB Efficient Filtering of XML Documents for Selective Dissemination of Information. Mehmet Altinel,Michael J. Franklin 2000 Efficient Filtering of XML Documents for Selective Dissemination of Information. VLDB Hypothetical Queries in an OLAP Environment. Andrey Balmin,Thanos Papadimitriou,Yannis Papakonstantinou 2000 Hypothetical Queries in an OLAP Environment. VLDB A Database Platform for Bioinformatics. Sandeepan Banerjee 2000 A Database Platform for Bioinformatics. VLDB Approximating Aggregate Queries about Web Pages via Random Walks. Ziv Bar-Yossef,Alexander C. Berg,Steve Chien,Jittat Fakcharoenphol,Dror Weitz 2000 Approximating Aggregate Queries about Web Pages via Random Walks. VLDB Building Scalable Internet Applications with Oracle8i Server. Julie Basu,José Alberto Fernández,Olga Peschansky 2000 Building Scalable Internet Applications with Oracle8i Server. VLDB Panel: Is Generic Metadata Management Feasible? Philip A. Bernstein,Laura M. Haas,Matthias Jarke,Erhard Rahm,Gio Wiederhold 2000 Panel: Is Generic Metadata Management Feasible? VLDB Information Integration: The MOMIS Project Demonstration. Domenico Beneventano,Sonia Bergamaschi,Silvana Castano,Alberto Corni,R. Guidetti,G. Malvezzi,Michele Melchiori,Maurizio Vincini 2000 Information Integration: The MOMIS Project Demonstration. VLDB Media360 Workflow-Implementing a Workflow Engine Inside a Database. Carsten Blecken 2000 Media360 Workflow-Implementing a Workflow Engine Inside a Database. VLDB "Toto, We're Not in Kansas Anymore: On Transitioning from Research to the Real (Invited Industrial Talk)." Michael J. Carey 2000 "Toto, We're Not in Kansas Anymore: On Transitioning from Research to the Real (Invited Industrial Talk)." VLDB XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents. Michael J. Carey,Jerry Kiernan,Jayavel Shanmugasundaram,Eugene J. Shekita,Subbu N. Subramanian 2000 XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents. VLDB Work and Information Practices in the Sciences of Biodiversity. Geoffrey C. Bowker 2000 Work and Information Practices in the Sciences of Biodiversity. VLDB Decision Tables: Scalable Classification Exploring RDBMS Capabilities. Hongjun Lu,Hongyan Liu 2000 Decision Tables: Scalable Classification Exploring RDBMS Capabilities. VLDB "Telcordia's Database Reconciliation and Data Quality Analysis Tool." Francesco Caruso,Munir Cochinwala,Uma Ganapathy,Gail Lalk,Paolo Missier 2000 "Telcordia's Database Reconciliation and Data Quality Analysis Tool." VLDB Process Automation as the Foundation for E-Business. Fabio Casati,Ming-Chien Shan 2000 Process Automation as the Foundation for E-Business. VLDB Panel: Future Directions of Database Research - The VLDB Broadening Strategy, Part 2. Michael L. Brodie 2000 Panel: Future Directions of Database Research - The VLDB Broadening Strategy, Part 2. VLDB Practical Applications of Triggers and Constraints: Success and Lingering Issues (10-Year Award). Stefano Ceri,Roberta Cochrane,Jennifer Widom 2000 Practical Applications of Triggers and Constraints: Success and Lingering Issues (10-Year Award). VLDB Approximate Query Processing Using Wavelets. Kaushik Chakrabarti,Minos N. Garofalakis,Rajeev Rastogi,Kyuseok Shim 2000 "Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in its query processing scope, typically focusing on specific forms of aggregate queries. Furthermore, conventional approaches based on sampling or histograms appear to be inherently limited when it comes to approximating the results of complex queries over high-dimensional DSS data sets. In this paper, we propose the use of multi-dimensional wavelets as an effective tool for general-purpose approximate query processing in modern, high-dimensional applications. Our approach is based on building wavelet-coefficient synopses of the data and using these synopses to provide approximate answers to queries. We develop novel query processing algorithms that operate directly on the wavelet-coefficient synopses of relational tables, allowing us to process arbitrarily complex queries entirely in the wavelet-coefficient domain. This guarantees extremely fast response times since our approximate query execution engine can do the bulk of its processing over compact sets of wavelet coefficients, essentially postponing the expansion into relational tuples until the end-result of the query. We also propose a novel wavelet decomposition algorithm that can build these synopses in an I/O-efficient manner. Finally, we conduct an extensive experimental study with synthetic as well as real-life data sets to determine the effectiveness of our wavelet-based approach compared to sampling and histograms. Our results demonstrate that our techniques: (1) provide approximate answers of better quality than either sampling or histograms; (2) offer query execution-time speedups of more than two orders of magnitude; and (3) guarantee extremely fast synopsis construction times that scale linearly with the size of the data." VLDB Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. Kaushik Chakrabarti,Sharad Mehrotra 2000 Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. VLDB Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails. Soumen Chakrabarti,Sandeep Srivastava,Mallela Subramanyam,Mitul Tiwari 2000 Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails. VLDB The Evolution of the Web and Implications for an Incremental Crawler. Junghoo Cho,Hector Garcia-Molina 2000 The Evolution of the Web and Implications for an Incremental Crawler. VLDB Approximate Query Translation Across Heterogeneous Information Sources. Kevin Chen-Chuan Chang,Hector Garcia-Molina 2000 Approximate Query Translation Across Heterogeneous Information Sources. VLDB Design and Implementation of a Genetic-Based Algorithm for Data Mining. Sunil Choenni 2000 Design and Implementation of a Genetic-Based Algorithm for Data Mining. VLDB Ordering Information, Conference Organizers, Program Committees, Additional Reviewers, Additional Demonstrations Reviewers, Sponsors, VLDB Endowment, Preface, Foreword. 2000 Ordering Information, Conference Organizers, Program Committees, Additional Reviewers, Additional Demonstrations Reviewers, Sponsors, VLDB Endowment, Preface, Foreword. VLDB Author Index. 2000 The paper describes a queuing network model for a multiprocessor system running a static Web workload such as SPECweb96. The model includes architectural details of the Web server in terms of multilevel cache hierarchy, processor bus, memory pipeline, PCI bus based I/O subsystem, and bypass I/O-memory path for DMA transfers. The model is based on detailed measurements from a baseline system and a few of its variants. The model operates at the Web transaction level, and does not explicitly model the CPU core or the caching hierarchy. Yet, the model predicts the performance impact of low level features such as number of processors, processor speeds, cache sizes and latencies, memory latencies, higher level caches, sector prefetching, etc. The model shows an excellent match with measured results. Because of many features that are difficult to handle analytically, the default solution technique is simulation. However, the paper also proposes a simple hybrid approach that can significantly speed up the solution without affecting the accuracy appreciably. The model has also been extended to handle clusters of symmetric multiprocessor systems with both centralized and distributed memories. The ACM Portal is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc. Terms of Usage Privacy Policy Code of Ethics Contact Us Useful downloads: Adobe Acrobat QuickTime Windows Media Player Real Player VLDB PicoDMBS: Scaling Down Database Techniques for the Smartcard. Christophe Bobineau,Luc Bouganim,Philippe Pucheral,Patrick Valduriez 2000 PicoDMBS: Scaling Down Database Techniques for the Smartcard. VLDB Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. Surajit Chaudhuri,Gerhard Weikum 2000 Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. VLDB Biodiversity Informatics Infrastructure: An Information Commons for the Biodiversity Community. Gladys A. Cotter,Barbara T. Bauldock 2000 Biodiversity Informatics Infrastructure: An Information Commons for the Biodiversity Community. VLDB Temporal Integrity Constraints with Indeterminacy. Wes Cowley,Dimitris Plexousakis 2000 Temporal Integrity Constraints with Indeterminacy. VLDB Linking Business to Deliver Value: A Data Management Challenge. Anand Deshpande 2000 Linking Business to Deliver Value: A Data Management Challenge. VLDB Toward Learning Based Web Query Processing. Yanlei Diao,Hongjun Lu,Songting Chen,Zengping Tian 2000 Toward Learning Based Web Query Processing. VLDB Focused Crawling Using Context Graphs. Michelangelo Diligenti,Frans Coetzee,Steve Lawrence,C. Lee Giles,Marco Gori 2000 Focused Crawling Using Context Graphs. VLDB Computing Geographical Scopes of Web Resources. Junyan Ding,Luis Gravano,Narayanan Shivakumar 2000 Computing Geographical Scopes of Web Resources. VLDB Demonstration: Enabling Scalable Online Personalization on the Web. Kaushik Dutta,Anindya Datta,Debra E. VanderMeer,Krithi Ramamritham,Helen M. Thomas 2000 Demonstration: Enabling Scalable Online Personalization on the Web. VLDB A 20/20 Vision of the VLDB-2020? S. Misbah Deen,Anant Jhingran,Shamkant B. Navathe,Erich J. Neuhold,Gio Wiederhold 2000 A 20/20 Vision of the VLDB-2020? VLDB Fast Time Sequence Indexing for Arbitrary Lp Norms. Byoung-Kee Yi,Christos Faloutsos 2000 Fast Time Sequence Indexing for Arbitrary Lp Norms. VLDB Efficient Numerical Error Bounding for Replicated Network Services. Haifeng Yu,Amin Vahdat 2000 Efficient Numerical Error Bounding for Replicated Network Services. VLDB Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. Jonathan Goldstein,Raghu Ramakrishnan 2000 Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. VLDB ICICLES: Self-Tuning Samples for Approximate Query Answering. Venkatesh Ganti,Mong-Li Lee,Raghu Ramakrishnan 2000 ICICLES: Self-Tuning Samples for Approximate Query Answering. VLDB Manipulating Interpolated Data is Easier than You Thought. Stéphane Grumbach,Philippe Rigaux,Luc Segoufin 2000 Manipulating Interpolated Data is Easier than You Thought. VLDB OLAP++: Powerful and Easy-to-Use Federations of OLAP and Object Databases. Junmin Gu,Torben Bach Pedersen,Arie Shoshani 2000 OLAP++: Powerful and Easy-to-Use Federations of OLAP and Object Databases. VLDB Optimizing Multi-Feature Queries for Image Databases. Ulrich Güntzer,Wolf-Tilo Balke,Werner Kießling 2000 Optimizing Multi-Feature Queries for Image Databases. VLDB Rainbow: Distributed Database System for Classroom Education and Experimental Research. Abdelsalam Helal,Hua Li 2000 Rainbow: Distributed Database System for Classroom Education and Experimental Research. VLDB What Is the Nearest Neighbor in High Dimensional Spaces? Alexander Hinneburg,Charu C. Aggarwal,Daniel A. Keim 2000 What Is the Nearest Neighbor in High Dimensional Spaces? VLDB An Ultra Highly Available DBMS. Svein-Olaf Hvasshovd,Svein Erik Bratsberg,Øystein Torbjørnsen 2000 An Ultra Highly Available DBMS. VLDB The Challenge of Process Data Warehousing. Matthias Jarke,Thomas List,Jörg Köller 2000 The Challenge of Process Data Warehousing. VLDB The BT-tree: A Branched and Temporal Access Method. Linan Jiang,Betty Salzberg,David B. Lomet,Manuel Barrena García 2000 The BT-tree: A Branched and Temporal Access Method. VLDB Optimizing Queries on Compressed Bitmaps. Sihem Amer-Yahia,Theodore Johnson 2000 Optimizing Queries on Compressed Bitmaps. VLDB "Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication." Bettina Kemme,Gustavo Alonso 2000 "Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication." VLDB Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. Piotr Indyk,Nick Koudas,S. Muthukrishnan 2000 Identifying Representative Trends in Massive Time Series Data Sets Using Sketches. VLDB Managing Intervals Efficiently in Object-Relational Databases. Hans-Peter Kriegel,Marco Pötke,Thomas Seidl 2000 Managing Intervals Efficiently in Object-Relational Databases. VLDB Performance Issues in Incremental Warehouse Maintenance. Wilburt Labio,Jun Yang,Yingwei Cui,Hector Garcia-Molina,Jennifer Widom 2000 Performance Issues in Incremental Warehouse Maintenance. VLDB The Zero Latency Enterprise. Dave Liles 2000 The Zero Latency Enterprise. VLDB The 3W Model and Algebra for Unified Data Mining. Theodore Johnson,Laks V. S. Lakshmanan,Raymond T. Ng 2000 The 3W Model and Algebra for Unified Data Mining. VLDB Biodiversity Informatics: The Challenge of Rapid Development, Large Databases, and Complex Data (Keynote). Meridith A. Lane,James L. Edwards,Ebbe Nielsen 2000 Biodiversity Informatics: The Challenge of Rapid Development, Large Databases, and Complex Data (Keynote). VLDB Hierarchical Compact Cube for Range-Max Queries. Sin Yeung Lee,Tok Wang Ling,Hua-Gang Li 2000 Hierarchical Compact Cube for Range-Max Queries. VLDB Model-Based Information Integration in a Neuroscience Mediator System. Bertram Ludäscher,Amarnath Gupta,Maryann E. Martone 2000 Model-Based Information Integration in a Neuroscience Mediator System. VLDB What Happens During a Join? Dissecting CPU and Memory Optimization Effects. Stefan Manegold,Peter A. Boncz,Martin L. Kersten 2000 What Happens During a Join? Dissecting CPU and Memory Optimization Effects. VLDB Agora: Living with XML and Relational. Ioana Manolescu,Daniela Florescu,Donald Kossmann,Florian Xhumari,Dan Olteanu 2000 Agora: Living with XML and Relational. VLDB Dynamic Maintenance of Wavelet-Based Histograms. Yossi Matias,Jeffrey Scott Vitter,Min Wang 2000 Dynamic Maintenance of Wavelet-Based Histograms. VLDB Temporal Queries in OLAP. Alberto O. Mendelzon,Alejandro A. Vaisman 2000 Temporal Queries in OLAP. VLDB Schema Mapping as Query Discovery. Renée J. Miller,Laura M. Haas,Mauricio A. Hernández 2000 Schema Mapping as Query Discovery. VLDB Evolution of Groupware for Business Applications: A Database Perspective on Lotus Domino/Notes. C. Mohan,Ron Barber,S. Watts,Amit Somani,Markos Zaharioudakis 2000 Evolution of Groupware for Business Applications: A Database Perspective on Lotus Domino/Notes. VLDB Integration of Data Mining with Database Technology. Amir Netz,Surajit Chaudhuri,Jeff Bernhardt,Usama M. Fayyad 2000 Integration of Data Mining with Database Technology. VLDB Asera: Extranet Architecture for B2B Solutions. Anil Nori 2000 Asera: Extranet Architecture for B2B Solutions. VLDB Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data. Chris Olston,Jennifer Widom 2000 Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data. VLDB A Case-Based Approach to Information Integration. Maurizio Panti,Luca Spalazzi,Alberto Giretti 2000 A Case-Based Approach to Information Integration. VLDB Design and Development of a Stream Service in a Heterogenous Client Environment. Nikos Pappas,Stavros Christodoulakis 2000 Design and Development of a Stream Service in a Heterogenous Client Environment. VLDB CheeTah: a Lightweight Transaction Server for Plug-and-Play Internet Data Management. Guy Pardon,Gustavo Alonso 2000 CheeTah: a Lightweight Transaction Server for Plug-and-Play Internet Data Management. VLDB The TreeScape System: Reuse of Pre-Computed Aggregates over Irregular OLAP Hierarchies. Torben Bach Pedersen,Christian S. Jensen,Curtis E. Dyreson 2000 The TreeScape System: Reuse of Pre-Computed Aggregates over Irregular OLAP Hierarchies. VLDB Publish/Subscribe on the Web at Extreme Speed. João Pereira,Françoise Fabret,François Llirbat,Radu Preotiuc-Pietro,Kenneth A. Ross,Dennis Shasha 2000 Publish/Subscribe on the Web at Extreme Speed. VLDB Novel Approaches in Query Processing for Moving Object Trajectories. Dieter Pfoser,Christian S. Jensen,Yannis Theodoridis 2000 Novel Approaches in Query Processing for Moving Object Trajectories. VLDB A Scalable Algorithm for Answering Queries Using Views. Rachel Pottinger,Alon Y. Levy 2000 A Scalable Algorithm for Answering Queries Using Views. VLDB Social, Educational, and Governmental Change Enabled through Information Technology. Krithi Ramamritham,Yeha El Atfi,Carlo Batini,Michael Eitan,Valerie Gregg,D. B. Phatak 2000 Social, Educational, and Governmental Change Enabled through Information Technology. VLDB Set Containment Joins: The Good, The Bad and The Ugly. Karthikeyan Ramasamy,Jignesh M. Patel,Jeffrey F. Naughton,Raghav Kaushik 2000 Set Containment Joins: The Good, The Bad and The Ugly. VLDB E.piphany Epicenter Technology Overview. Sridhar Ramaswamy 2000 E.piphany Epicenter Technology Overview. VLDB Integrating the UB-Tree into a Database System Kernel. Frank Ramsak,Volker Markl,Robert Fenk,Martin Zirkel,Klaus Elhardt,Rudolf Bayer 2000 Integrating the UB-Tree into a Database System Kernel. VLDB Semantic Access: Semantic Interface for Querying Databases. Naphtali Rishe,Jun Yuan,Rukshan Athauda,Shu-Ching Chen,Xiaoling Lu,Xiaobin Ma,Alexander Vaschillo,Artyom Shaposhnikov,Dmitry Vasilevsky 2000 Semantic Access: Semantic Interface for Querying Databases. VLDB The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation. Yasushi Sakurai,Masatoshi Yoshikawa,Shunsuke Uemura,Haruhiko Kojima 2000 The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation. VLDB User-Adaptive Exploration of Multidimensional Data. Sunita Sarawagi 2000 User-Adaptive Exploration of Multidimensional Data. VLDB Controlling Data Warehouses with Knowledge Networks. Elvira Schaefer,Jan-Dirk Becker,Andreas Boehmer,Matthias Jarke 2000 Controlling Data Warehouses with Knowledge Networks. VLDB Panel: Future Directions of Database Research - The VLDB Broadening Strategy, Part 1. Hans-Jörg Schek 2000 Panel: Future Directions of Database Research - The VLDB Broadening Strategy, Part 1. VLDB Research Directions in Biodiversity Informatics John L. Schnase 2000 Research Directions in Biodiversity Informatics VLDB INSITE: A Tool for Interpreting Users? Interaction with a Web Space. Cyrus Shahabi,Adil Faisal,Farnoush Banaei Kashani,Jabed Faruque 2000 INSITE: A Tool for Interpreting Users? Interaction with a Web Space. VLDB Efficiently Publishing Relational Data as XML Documents. Jayavel Shanmugasundaram,Eugene J. Shekita,Rimon Barr,Michael J. Carey,Bruce G. Lindsay,Hamid Pirahesh,Berthold Reinwald 2000 XML is rapidly emerging as a standard for exchanging business data on the World Wide Web. For the foreseeable future, however, most business data will continue to be stored in relational database systems. Consequently, if XML is to fulfill its potential, some mechanism is needed to publish relational data as XML documents. Towards that goal, one of the major challenges is finding a way to efficiently structure and tag data from one or more tables as a hierarchical XML document. Different alternatives are possible depending on when this processing takes place and how much of it is done inside the relational engine. In this paper, we characterize and study the performance of these alternatives. Among other things, we explore the use of new scalar and aggregate functions in SQL for constructing complex XML documents directly in the relational engine. We also explore different execution plans for generating the content of an XML document. The results of an experimental study show that constructing XML documents inside the relational engine can have a significant performance benefit. Our results also show the superiority of having the relational engine use what we call an “outer union plan” to generate the content of an XML document. VLDB Oracle8i Index-Organized Table and Its Application to New Domains. Jagannathan Srinivasan,Souripriya Das,Chuck Freiwald,Eugene Inseok Chong,Mahesh Jagannath,Aravind Yalamanchi,Ramkumar Krishnan,Anh-Tuan Tran,Samuel DeFazio,Jayanta Banerjee 2000 Oracle8i Index-Organized Table and Its Application to New Domains. VLDB Multi-Dimensional Database Allocation for Parallel Data Warehouses. Thomas Stöhr,Holger Märtens,Erhard Rahm 2000 Multi-Dimensional Database Allocation for Parallel Data Warehouses. VLDB Concurrency in the Data Warehouse. Richard Taylor 2000 Concurrency in the Data Warehouse. VLDB High-Performance and Scalability through Application Tier,In-Memory Data Management. 2000 High-Performance and Scalability through Application Tier,In-Memory Data Management. VLDB Data Mining in the Bioinformatics Domain. Shalom Tsur 2000 Data Mining in the Bioinformatics Domain. VLDB Mining Frequent Itemsets Using Support Constraints. Ke Wang,Yu He,Jiawei Han 2000 Mining Frequent Itemsets Using Support Constraints. VLDB Using SQL to Build New Aggregates and Extenders for Object- Relational Systems. Haixun Wang,Carlo Zaniolo 2000 Using SQL to Build New Aggregates and Extenders for Object- Relational Systems. VLDB FALCON: Feedback Adaptive Loop for Content-Based Retrieval. Leejay Wu,Christos Faloutsos,Katia P. Sycara,Terry R. Payne 2000 FALCON: Feedback Adaptive Loop for Content-Based Retrieval. VLDB Caching Strategies for Data-Intensive Web Sites. Khaled Yagoub,Daniela Florescu,Valérie Issarny,Patrick Valduriez 2000 Caching Strategies for Data-Intensive Web Sites. VLDB Building and Customizing Data-Intensive Web Sites Using Weave. Khaled Yagoub,Daniela Florescu,Valérie Issarny,Patrick Valduriez 2000 Building and Customizing Data-Intensive Web Sites Using Weave. SIGMOD Record A Vision of Management of Complex Models. Philip A. Bernstein,Alon Y. Halevy,Rachel Pottinger 2000 A Vision of Management of Complex Models. SIGMOD Record Provision of Market Services for eCo Compliant Electronic Market Places. Sena Nural Arpinar,Asuman Dogac 2000 Provision of Market Services for eCo Compliant Electronic Market Places. SIGMOD Record Comparative Analysis of Five XML Query Languages. Angela Bonifati,Stefano Ceri 2000 XML is becoming the most relevant new standard for data representation and exchange on the WWW. Novel languages for extracting and restructuring the XML content have been proposed, some in the tradition of database query languages (i.e. SQL, OQL), others more closely inspired by XML. No standard for XML query language has yet been decided, but the discussion is ongoing within the World Wide Web Consortium and within many academic institutions and Internet-related major companies. We present a comparison of five, representative query languages for XML, highlighting their common features and differences. SIGMOD Record Constraint databases: A tutorial introduction. Jan Van den Bussche 2000 We give a tutorial introduction to the basic definitions surrounding the idea of constraint databases, and survey and indicate some of the achieved research results on this subject. This paper is not written as a scholarly piece, nor as polished course notes, but rather as something like the transcript of an invited talk I gave at a meeting bringing together researchers from finite model theory, database theory, and computer-aided verification, which was held at Schloss Dagstuhl in October 1999.Very recently the first book on the subject appeared [20]. It covers the state of the art in constraint databases up to, say, mid 1999 [20]. You should see this paper merely as an appetizer for the book. I will also not try to be complete in my bibliographical references. Again, see the book for that. SIGMOD Record Spatial Operators. Eliseo Clementini,Paolino Di Felice 2000 "This paper discusses issues related to the integration of spatial operators into the new generation of SQL-like query languages. Starting from spatial data models, current spatial extensions of query languages are briefly reviewed and research directions are highlighted. A taxonomy of requirements to be satisfied by spatial operators is proposed with emphasis on users' needs and on the introduction of data uncertainty support. Further, spatial operators are classified into the three important categories of topological, projective, and metric operators and for each of them the state of the art is outlined." SIGMOD Record VLDB Workshop on Technologies in E-Services (TES). Fabio Casati,Umeshwar Dayal,Ming-Chien Shan 2000 VLDB Workshop on Technologies in E-Services (TES). SIGMOD Record SIGMOD Sister Societies. Stefano Ceri,Leonid A. Kalinichenko,Masaru Kitsuregawa,Hongjun Lu,Z. Meral Özsoyoglu,Richard T. Snodgrass,Victor Vianu 2000 SIGMOD Sister Societies. SIGMOD Record "SIGMOD Digital Symposium Collection (DiSC) Editor's Message." Isabel F. Cruz 2000 "SIGMOD Digital Symposium Collection (DiSC) Editor's Message." SIGMOD Record Incremental Maintenance of Recursive Views Using Relational Calculus/SQL. Guozhu Dong,Jianwen Su 2000 Views are a central component of both traditional database systems and new applications such as data warehouses. Very often the desired views (e.g. the transitive closure) cannot be defined in the standard language of the underlying database system. Fortunately, it is often possible to incrementally maintain these views using the standard language. For example, transitive closure of acyclic graphs, and of undirected graphs, can be maintained in relational calculus after both single edge insertions and deletions. Many such results have been published in the theoretical database community. The purpose of this survey is to make these useful results known to the wider database research and development community. There are many interesting issues involved in the maintenance of recursive views. A maintenance algorithm may be applicable to just one view, or to a class of views specified by a view definition language such as Datalog. The maintenance algorithm can be specified in a maintenance language of different expressiveness, such as the conjunctive queries, the relational calculus or SQL. Ideally, this maintenance language should be less expensive than the view definition language. The maintenance algorithm may allow updates of different kinds, such as just single tuple insertions, just single tuple deletions, special set-based insertions and/or deletions, or combinations thereof. The view maintenance algorithms may also need to maintain auxiliary relations to help maintain the views of interest. It is of interest to know the minimal arity necessary for these auxiliary relations and whether the auxiliary relations are deterministic. While many results are known about these issues for several settings, many further challenging research problems still remain to be solved. SIGMOD Record SQL Standardization: The Next Steps. Andrew Eisenberg,Jim Melton 2000 SQL Standardization: The Next Steps. SIGMOD Record "Editor's Notes." Michael J. Franklin 2000 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 2000 "Editor's Notes." SIGMOD Record "Editor's Notes." Michael J. Franklin 2000 "Editor's Notes." SIGMOD Record "Research and Practice in Federated Information Systems, Report of the EFIS '2000 International Workshop." Wilhelm Hasselbring,Willem-Jan van den Heuvel,Geert-Jan Houben,Ralf-Detlef Kutsche,Bodo Rieger,Mark Roantree,Kazimierz Subieta 2000 "Research and Practice in Federated Information Systems, Report of the EFIS '2000 International Workshop." SIGMOD Record ACM-SIGMOD Digital Review. H. V. Jagadish 2000 ACM-SIGMOD Digital Review. SIGMOD Record Theory of Answering Queries Using Views. Alon Y. Halevy 2000 The problem of answering queries using views is to find efficient methods of answering a query using a set of previously materialized views over the database, rather than accessing the database relations. The problem has recently received significant attention because of its relevance to a wide variety of data management problems, such as query optimization, the maintenance of physical data independence, data integration and data warehousing. This article surveys the theoretical issues concerning the problem of answering queries using views. SIGMOD Record "Treasurer's Message." Joachim Hammer 2000 "Treasurer's Message." SIGMOD Record Moving up the food chain: Supporting E-Commerce Applications on Databases. Anant Jhingran 2000 Database systems have enjoyed a tremendous market because they have served many applications really well -- transaction processing in the beginning, and then decision support. Today, with over 200% cumulative growth rate in certain segments of E-Commerce, it is clear that this new class of applications will be a strong driver for databases to grow, commercially, as well as from a Research perspective. This paper outlines some of the issues that I have learnt in dealing with E-Commerce applications that may well be the focus of some of the research in database systems over the course of next few years. SIGMOD Record Workshop on Performance and Architecture of Web Servers (PAWS-2000, held in conjection with SIGMETRICS-2000). Krishna Kant,Prasant Mohapatra 2000 Workshop on Performance and Architecture of Web Servers (PAWS-2000, held in conjection with SIGMETRICS-2000). SIGMOD Record "SIGMOD Anthology Editor's Message." Michael Ley 2000 "SIGMOD Anthology Editor's Message." SIGMOD Record An Extensible Compressor for XML Data. Hartmut Liefke,Dan Suciu 2000 An Extensible Compressor for XML Data. SIGMOD Record Generating Dynamic Content at Database-Backed Web Servers: cgi-bin vs. mod_perl. Alexandros Labrinidis,Nick Roussopoulos 2000 Web servers are increasingly being used to deliver dynamic content rather than static HTML pages. In order to generate web pages dynamically, servers need to execute a script, which typically connects to a DBMS. Although CGI was the first approach at server side scripting, it has significant performance shortcomings. Currently, there are many alternative server side scripting architectures which offer better performance than CGI. In this paper, we report our experiences using mod_perl, an Apache Server module, which can improve the performance of CGI scripts by at least an order of magnitude. Except for presenting results from our experiments, we also briefly describe the implementation of an industrial strength database-backed web site that we recently built and give a quick overview of the various server-side scripting mechanisms. SIGMOD Record Comparative Analysis of Six XML Schema Languages. Dongwon Lee,Wesley W. Chu 2000 As XML [5] is emerging as the data format of the internet era, there is an substantial increase of the amount of data in XML format. To better describe such XML data structures and constraints, several XML schema languages have been proposed. In this paper, we present a comparative analysis of six noteworthy XML schema languages. SIGMOD Record "Report in ISDO '00: The CAiSE*00 Workshop on Infrastructures for Dynamic Business-to-Business Service Outsourcinga." Heiko Ludwig,Paul W. P. J. Grefen 2000 "Report in ISDO '00: The CAiSE*00 Workshop on Infrastructures for Dynamic Business-to-Business Service Outsourcinga." SIGMOD Record "Information Director's Message." Alberto O. Mendelzon 2000 "Information Director's Message." SIGMOD Record "SIGMOD'2000 Program Chair's Message." Jeffrey F. Naughton 2000 "SIGMOD'2000 Program Chair's Message." SIGMOD Record "Vice Chair's Message." Z. Meral Özsoyoglu 2000 "Vice Chair's Message." SIGMOD Record Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies. Andreas Paepcke,Hector Garcia-Molina,Gerard Rodríguez-Mulà,Junghoo Cho 2000 "In the face of small, one or two word queries, high volumes of diverse documents on the Web are overwhelming search and ranking technologies that are based on document similarity measures. The increase of multimedia data within documents sharply exacerbates the shortcomings of these approaches. Recently, research prototypes and commercial experiments have added techniques that augment similarity-based search and ranking. These techniques rely on judgments about the 'value' of documents. Judgments are obtained directly from users, are derived by conjecture based on observations of user behavior, or are surmised from analyses of documents and collections. All these systems have been pursued independently, and no common understanding of the underlying processes has been presented. We survey existing value-based approaches, develop a reference architecture that helps compare the approaches, and categorize the constituent algorithms. We explain the options for collecting value metadata, and for using that metadata to improve search, ranking of results, and the enhancement of information browsing. Based on our survey and analysis, we then point to several open problems." SIGMOD Record Knowledge Discovery in Data Warehouses. Themistoklis Palpanas 2000 As the size of data warehouses increase to several hundreds of gigabytes or terabytes, the need for methods and tools that will automate the process of knowledge extraction, or guide the user to subsets of the dataset that are of particular interest, is becoming prominent. In this survey paper we explore the problem of identifying and extracting interesting knowledge from large collections of data residing in data warehouses, by using data mining techniques. Such techniques have the ability to identify patterns and build succinct models to describe the data. These models can also be used to achieve summarization and approximation. We review the associated work in the OLAP, data mining, and approximate query answering literature. We discuss the need for the traditional data mining techniques to adapt, and accommodate the specific characteristics of OLAP systems. We also examine the notion of interestingness of data, as a tool to guide the analysis process. We describe methods that have been proposed in the literature for determining what is interesting to the user and what is not, and how these approaches can be incorporated in the data mining algorithms. SIGMOD Record New TPC Benchmarks for Decision Support and Web Commerce. Meikel Pöss,Chris Floyd 2000 "For as long as there have been DBMS's and applications that use them, there has been interest in the performance characteristics that these systems exhibit. This month's column describes some of the recent work that has taken place in TPC, the Transaction Processing Performance Council.TPC-A and TPC-B are obsolete benchmarks that you might have heard about in the past. TPC-C V3.5 is the current benchmark for OLTP systems. Introduced in 1992, it has been run on many hardware platforms and DBMS's. Indeed, the TPC web site currently lists 202 TPC-C benchmark results. Due to its maturity, TPC-C will not be discussed in this article.We've asked two very knowledgeable individuals to write this article. Meikel Poess is the chair of the TPC H and TPC-R Subcommittees and Chris Floyd is the chair of the TPC-W Subcommittee. We greatly appreciate their efforts.A wealth of information can be found at the TPC web site [ 1 ]. This information includes the benchmark specifications themselves, TPC membership information, and benchmark results." SIGMOD Record Using Quantitative Information for Efficient Association Rule Generation. Bruno Pôssas,Wagner Meira Jr.,Márcio de Carvalho,Rodolfo F. Resende 2000 Using Quantitative Information for Efficient Association Rule Generation. SIGMOD Record Hierarchies and Relative Operators in the OLAP Environment. Elaheh Pourabbas,Maurizio Rafanelli 2000 In the last few years, numerous proposals for modelling and querying Multidimensional Databases (MDDB) are proposed. A rigorous classification of the different types of hierarchies is still an open problem. In this paper we propose and discuss some different types of hierarchies within a single dimension of a cube. These hierarchies divide in different levels of aggregation a single dimension. Depending on them, we discuss the characterization of some OLAP operators that refer to hierarchies in order to maintain the data cube consistency. Moreover, we propose a set of operators for changing the hierarchy structure. The issues discussed provide modelling flexibility during the scheme design phase and correct data analysis. SIGMOD Record Object Database Evolution Using Separation of Concerns. Awais Rashid,Peter Sawyer 2000 This paper proposes an object database evolution approach based on separation of concerns. The lack of customisability and extensibility in existing evolution frameworks is a consequence of using attributes at the meta-object level to implement links among meta-objects and the injection of instance adaptation code directly into the class versions. The proposed approach uses dynamic relationships to separate the connection code from meta-objects and aspects - abstractions used by Aspect-Oriented Programming to localise cross-cutting concerns - to separate the instance adaptation code from class versions. The result is a customisable and extensible evolution framework with low maintenance overhead. SIGMOD Record Evolution and Change in Data Management - Issues and Directions. John F. Roddick,Lina Al-Jadir,Leopoldo E. Bertossi,Marlon Dumas,Florida Estrella,Heidi Gregersen,Kathleen Hornsby,Jens Lufter,Federica Mandreoli,Tomi Männistö,Enric Mayol,Lex Wedemeijer 2000 Evolution and Change in Data Management - Issues and Directions. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Charu C. Aggarwal,Alfons Kemper,Sunita Sarawagi,S. Sudarshan,Mihalis Yannakakis 2000 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. "Kenneth A. Ross,Christos Faloutsos,Alon Y. Levy,Patrick E. O'Neil,Eric Simon,Divesh Srivastava,Victor Vianu,Gerhard Weikum" 2000 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Yannis E. Ioannidis,Anant Jhingran,Christos H. Papadimitriou 2000 Reminiscences on Influential Papers. SIGMOD Record Report on Second International Workshop on Advanced Issues of E-Commerce and Web-based Information Systems. Kun-Lung Wu,Philip S. Yu 2000 "The Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS 2000) was held at the Crowne Plaza San Jose/Silicon Valley in Milpitas, California on June 8-9, 2000. The purpose of this workshop was to bring together leading practitioners, developers and researchers to explore the challenging technical issues and find feasible solutions for advancing the current state of the art in e-commerce and web-based information systems. In particular, the workshop was interested in the infrastructure issues to facilitate e-commerce and Web-based information systems.WECWIS 2000 was successful. There were three invited talks, one industrial panel discussion and six technical sessions. The keynote speech, ""The global trading web: A strategic vision for the Internet economy,"" was delivered by Dr. Jay M. Tenebaum, VP and Chief Scientist, Commerce One, Inc., on June 8 immediately following the opening remarks by the conference chair. The banquet address, ""Business issues in e-commerce,"" was delivered by Mr. Daniel Druker, General Manager, Hyperion e-Business Division. Finally, a lunch address, ""B2C, B2B, N2N, N2M: Why 2 is so instrumental?"" was delivered by Mr. Mstafa A. Syed, VP of Technology, VertialNet, Inc.The industrial panel was moderated by Dr. L. Mason and Dr. Z. Zhang, both of Blue Martini Software. The panelists included J. Becher, Accrue Software; L. Mellot, Business Objects; A. Srivastava, Blue Martini Software; and C. Zhou, IBM. The panel discussion topic was ""Can e-business intelligence survive?"" Among the many interesting issues being discussed were: Will privacy concerns stunt e-business intelligence utility? Will integrated e-commerce solutions be able to collect and analyze click steams, contents, products and sales data simultaneously? To what extent can out-of-the-box combined e-commerce and e-business intelligence solutions be useful? Is data mining useful in B2B e-commerce? Both positive and negative responses were hotly debated.There were a total of 30 papers included in the technical presentations, organized into six sessions. They were selected after rigorous reviews by the program committee members. The presented papers cover a wide range of topics, from framework, architecture and protocol issues of e-commerce to various types of e-services to web-based information systems for facilitating e-commerce. The rest of this report provides a brief summary of the technical presentations given in the workshop. The entire workshop proceedings is available from the IEEE Computer Society." SIGMOD Record "Chair's Message." Richard T. Snodgrass 2000 "Chair's Message." SIGMOD Record "Chair's Message." Richard T. Snodgrass 2000 "Chair's Message." SIGMOD Record "Chair's Message." Richard T. Snodgrass 2000 "Chair's Message." SIGMOD Record SIGMOD Program Review. Richard T. Snodgrass 2000 SIGMOD Program Review. SIGMOD Record Generating Spatiotemporal Datasets on the WWW. Yannis Theodoridis,Mario A. Nascimento 2000 Efficient storage, indexing and retrieval of time-evolving spatial data are some of the tasks that a Spatiotemporal Database Management System (STDBMS) must support. Aiming at designers of indexing methods and access structures, in this article we review the GSTD algorithm for generating spatiotemporal datasets according to several user-defined parameters, and introduce a WWW-based environment for generating and visualizing such datasets. The GSTD interface is available at two main sites: http://www.cti.gr/RD3/GSTD/ and http://www.cs.ualberta.ca/~mn/GSTD/. SIGMOD Record An Optimisation Scheme for Coalesce/Valid Time Selection Operator Sequences. Costas Vassilakis 2000 Queries in temporal databases often employ the coalesce operator, either to coalesce results of projections, or data which are not coalesced upon storage. Therefore, the performance and the optimisation schemes utilised for this operator is of major importance for the performance of temporal DBMSs. Insofar, performance studies for various algorithms that implement this operator have been conducted, however, the joint optimisation of the coalesce operator with other algebraic operators that appear in the query execution plan has only received minimal attention. In this paper, we propose a scheme for combining the coalesce operator with selection operators which are applied to the valid time of the tuples produced from a coalescing operation. The proposed scheme aims at reducing the number of tuples that a coalescing operator must process, while at the same time allows the optimiser to exploit temporal indices on the valid time of the data. SIGMOD Record Metadata Standards for Data Warehousing: Open Information Model vs. Common Warehouse Metamodel. Thomas Vetterli,Anca Vaduva,Martin Staudt 2000 Metadata has been identified as a key success factor in data warehouse projects. It captures all kinds of information necessary to analyse, design, build, use, and interpret the data warehouse contents. In order to spread the use of metadata, enable the interoperability between repositories, and tool integration within data warehousing architectures, a standard for metadata representation and exchange is needed. This paper considers two standards and compares them according to specific areas of interest within data warehousing. Despite their incontestable similarities, there are significant differences between the two standards which would make their unification difficult. SIGMOD Record The Implementation and Performance of Compressed Databases. Till Westmann,Donald Kossmann,Sven Helmer,Guido Moerkotte 2000 "In this paper, we show how compression can be integrated into a relational database system. Specifically, we describe how the storage manager, the query execution engine, and the query optimizer of a database system can be extended to deal with compressed data. Our main result is that compression can significantly improve the response time of queries if very light-weight compression techniques are used. We will present such light-weight compression techniques and give the results of running the TPC-D benchmark on a so compressed database and a non-compressed database using the AODB database system, an experimental database system that was developed at the Universities of Mannheim and Passau. Our benchmark results demonstrate that compression indeed offers high performance gains (up to 50%) for IO-intensive queries and moderate gains for CPU-intensive queries. Compression can, however, also increase the running time of certain update operations. In all, we recommend to extend today's database systems with light-weight compression techniques and to make extensive use of this feature." SIGMOD Record Cache Invalidation Scheme for Mobile Computing Systems with Real-time Data. Joe Chun-Hung Yuen,Edward Chan,Kam-yiu Lam,Hei-Wing Leung 2000 In this paper, we propose a cache invalidation scheme called Invalidation by Absolute Validity Interval (IAVI) for mobile computing systems. In IAVI, we define an absolute validity interval (AVI), for each data item based on its dynamic property such as the update interval. A mobile client can verify the validity of a cached item by comparing the last update time and its AVI. A cached item is invalidated if the current time is greater than the last update time plus its AVI. With this self-invalidation mechanism, the IAVI scheme uses the invalidation report to inform the mobile clients about changes in AVIs rather than the update event of the data items. As a result, the size of the invalidation report can be reduced significantly. Through extensive simulation experiments, we have found that the performance of the IVAI scheme is significantly better than other methods such as bit sequence and timestamp. ICDE Dependable Computing in Virtual Laboratories. Gustavo Alonso,Win Bausch,Cesare Pautasso,Ari Kahn,Michael T. Hallett 2001 Dependable Computing in Virtual Laboratories. ICDE A Split Operator for Now-Relative Bitemporal Databases. Mikkel Agesen,Michael H. Böhlen,Lasse Poulsen,Kristian Torp 2001 A Split Operator for Now-Relative Bitemporal Databases. ICDE Selectivity Estimation for Spatial Joins. Ning An,Zhen-Yu Yang,Anand Sivasubramaniam 2001 Selectivity Estimation for Spatial Joins. ICDE The MD-join: An Operator for Complex OLAP. Damianos Chatziantoniou,Michael O. Akinde,Theodore Johnson,Samuel Kim 2001 The MD-join: An Operator for Complex OLAP. ICDE Measuring and Optimizing a System for Persistent Database Sessions. Roger S. Barga,David B. Lomet 2001 Measuring and Optimizing a System for Persistent Database Sessions. ICDE A Cost Model and Index Architecture for the Similarity Join. Christian Böhm,Hans-Peter Kriegel 2001 A Cost Model and Index Architecture for the Similarity Join. ICDE Quality-Aware and Load-Sensitive Planning of Image Similarity Queries. Klemens Böhm,Michael Mlivoncic,Roger Weber 2001 Quality-Aware and Load-Sensitive Planning of Image Similarity Queries. ICDE The Skyline Operator. Stephan Börzsönyi,Donald Kossmann,Konrad Stocker 2001 The Skyline Operator. ICDE Developing Web Service. Adam Bosworth 2001 Adopting this architectural style is no cure-all. ICDE Processing Queries with Expensive Functions and Large Objects in Distributed Mediator Systems. Luc Bouganim,Françoise Fabret,Fabio Porto,Patrick Valduriez 2001 Processing Queries with Expensive Functions and Large Objects in Distributed Mediator Systems. ICDE MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. Douglas Burdick,Manuel Calimlim,Johannes Gehrke 2001 MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. ICDE E-Business Applications for Supply Chain Automation: Challenges and Solutions. Fabio Casati,Umeshwar Dayal,Ming-Chien Shan 2001 E-Business Applications for Supply Chain Automation: Challenges and Solutions. ICDE Inter-Enterprise Collaborative Business Process Management. Qiming Chen,Meichun Hsu 2001 Inter-Enterprise Collaborative Business Process Management. ICDE Counting Twig Matches in a Tree. Zhiyuan Chen,H. V. Jagadish,Flip Korn,Nick Koudas,S. Muthukrishnan,Raymond T. Ng,Divesh Srivastava 2001 Counting Twig Matches in a Tree. ICDE Overcoming Limitations of Sampling for Aggregation Queries. Surajit Chaudhuri,Gautam Das,Mayur Datar,Rajeev Motwani,Vivek R. Narasayya 2001 Overcoming Limitations of Sampling for Aggregation Queries. ICDE B+ Tree Indexes with Hybrid Row Identifiers in Oracle 8i. Eugene Inseok Chong,Souripriya Das,Chuck Freiwald,Jagannathan Srinivasan,Aravind Yalamanchi,Mahesh Jagannath,Anh-Tuan Tran,Ramkumar Krishnan 2001 B+ Tree Indexes with Hybrid Row Identifiers in Oracle 8i. ICDE Database Performance for Next Generation Telecommunications. Munir Cochinwala 2001 Database Performance for Next Generation Telecommunications. ICDE High-Performance, Space-Efficient, Automated Object Locking. Laurent Daynès,Grzegorz Czajkowski 2001 High-Performance, Space-Efficient, Automated Object Locking. ICDE The Importance of Extensible Database Systems for E-Commerce. Samuel DeFazio,Ramkumar Krishnan,Jagannathan Srinivasan,Saydean Zeldin 2001 The Importance of Extensible Database Systems for E-Commerce. ICDE Bundles in Captivity: An Application of Superimposed Information. Lois M. L. Delcambre,David Maier,Shawn Bowers,Mathew Weaver,Longxing Deng,Paul Gorman,Joan Ash,Mary Lavelle,Jason Lyman 2001 Bundles in Captivity: An Application of Superimposed Information. ICDE The Nimble XML Data Integration System. Denise Draper,Alon Y. Halevy,Daniel S. Weld 2001 The Nimble XML Data Integration System. ICDE Approximate Nearest Neighbor Searching in Multimedia Databases. Hakan Ferhatosmanoglu,Ertem Tuncel,Divyakant Agrawal,Amr El Abbadi 2001 Approximate Nearest Neighbor Searching in Multimedia Databases. ICDE Similarity Search without Tears: The OMNI Family of All-purpose Access Methods. Roberto F. Santos Filho,Agma J. M. Traina,Caetano Traina Jr.,Christos Faloutsos 2001 Similarity Search without Tears: The OMNI Family of All-purpose Access Methods. ICDE Data Management Support of Web Applications. Daniel H. Fishman 2001 Data Management Support of Web Applications. ICDE Efficient Bulk Deletes in Relational Databases. Andreas Gärtner,Alfons Kemper,Donald Kossmann,Bernhard Zeller 2001 Efficient Bulk Deletes in Relational Databases. ICDE Infrasturucture for Web-based Application Integration. Dieter Gawlick 2001 Infrasturucture for Web-based Application Integration. ICDE Mobile Data Management: Challenges of Wireless and Offline Data Access. Eric Gignere 2001 Abstract: Applications require access to database servers for many purposes. Mobile users, those who use their computing devices away from a traditional local area network, require access to data even when central database servers are unavailable. iAnywhere Solutions provides a number of solutions that address the challenges of offline and wireless data access. In this talk, we discuss those challenges and our solutions. ICDE High-level Parallelism in a Database Cluster: A Feasibility Study Using Document Services. Torsten Grabs,Klemens Böhm,Hans-Jörg Schek 2001 High-level Parallelism in a Database Cluster: A Feasibility Study Using Document Services. ICDE B-Tree Indexes and CPU Caches. Goetz Graefe,Per-Åke Larson 2001 B-Tree Indexes and CPU Caches. ICDE On Dual Mining: From Patterns to Circumstances, and Back. Gösta Grahne,Laks V. S. Lakshmanan,Xiaohong Wang,Ming Hao Xie 2001 On Dual Mining: From Patterns to Circumstances, and Back. ICDE CORBA Notification Service: Design Challenges and Scalable Solutions. Robert E. Gruber,Balachander Krishnamurthy,Euthimios Panagos 2001 CORBA Notification Service: Design Challenges and Scalable Solutions. ICDE Discovery and Application of Check Constraints in DB2. Jarek Gryz,K. Bernhard Schiefer,Jian Zheng,Calisto Zuzarte 2001 Discovery and Application of Check Constraints in DB2. ICDE Prefetching Based on Type-Level Access Pattern in Object-Relational DBMSs. Wook-Shin Han,Yang-Sae Moon,Kyu-Young Whang,Il-Yeol Song 2001 Prefetching Based on Type-Level Access Pattern in Object-Relational DBMSs. ICDE Workflow and Process Synchronization with Interaction Expressions and Graphs. Christian Heinlein 2001 Workflow and Process Synchronization with Interaction Expressions and Graphs. ICDE Exactly-once Semantics in a Replicated Messaging System. Yongqiang Huang,Hector Garcia-Molina 2001 Exactly-once Semantics in a Replicated Messaging System. ICDE Variable Length Queries for Time Series Data. Tamer Kahveci,Ambuj K. Singh 2001 Variable Length Queries for Time Series Data. ICDE Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information. "Norio Katayama,Shin'ichi Satoh" 2001 Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information. ICDE An XML Indexing Structure with Relative Region Coordinate. Dao Dinh Kha,Masatoshi Yoshikawa,Shunsuke Uemura 2001 An XML Indexing Structure with Relative Region Coordinate. ICDE An Index-Based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases. Sang-Wook Kim,Sanghyun Park,Wesley W. Chu 2001 An Index-Based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases. ICDE An Efficient Approximation Scheme for Data Mining Tasks. George Kollios,Dimitrios Gunopulos,Nick Koudas,Stefan Berchtold 2001 An Efficient Approximation Scheme for Data Mining Tasks. ICDE A Temporal Algebra for an ER-Based Temporal Data Model. Jae Young Lee,Ramez Elmasri 2001 A Temporal Algebra for an ER-Based Temporal Data Model. ICDE SAP Business Information Warehouse - From Data Warehousing to an E-business Platform. Thomas Zurek,Klaus Kreplin 2001 SAP Business Information Warehouse - From Data Warehousing to an E-business Platform. ICDE Differential Logging: A Commutative and Associative Logging Scheme for Highly Parallel Main Memory Databases. Juchang Lee,Kihong Kim,Sang Kyun Cha 2001 Differential Logging: A Commutative and Associative Logging Scheme for Highly Parallel Main Memory Databases. ICDE Mining Partially Periodic Event Patterns with Unknown Periods. Sheng Ma,Joseph L. Hellerstein 2001 Mining Partially Periodic Event Patterns with Unknown Periods. ICDE fAST Refresh using Mass Query Optimization. Wolfgang Lehner,Roberta Cochrane,Hamid Pirahesh,Markos Zaharioudakis 2001 fAST Refresh using Mass Query Optimization. ICDE IBM DB2 Everyplace: A Small Footprint Relational Database System. Jonas S. Karlsson,Amrish Lal,T. Y. Cliff Leung,Thanh Pham 2001 IBM DB2 Everyplace: A Small Footprint Relational Database System. ICDE Efficient Sequenced Integrity Constraint Checking. Wei Li,Richard T. Snodgrass,Shiyan Deng,Vineel Kumar Gattu,Aravindan Kasthurirangan 2001 Efficient Sequenced Integrity Constraint Checking. ICDE High Dimensional Similarity Search With Space Filling Curves. Swanwa Liao,Mario A. Lopez,Scott T. Leutenegger 2001 High Dimensional Similarity Search With Space Filling Curves. ICDE An Automated Change Detection Algorithm for HTML Documents Based on Semantic Hierarchies. Seung Jin Lim,Yiu-Kai Ng 2001 An Automated Change Detection Algorithm for HTML Documents Based on Semantic Hierarchies. ICDE Model-Based Mediation with Domain Maps. Bertram Ludäscher,Amarnath Gupta,Maryann E. Martone 2001 Model-Based Mediation with Domain Maps. ICDE Database Managed External File Update. Neeraj Mittal,Hui-I Hsiao 2001 Database Managed External File Update. ICDE Duality-Based Subsequence Matching in Time-Series Databases. Yang-Sae Moon,Kyu-Young Whang,Woong-Kee Loh 2001 Duality-Based Subsequence Matching in Time-Series Databases. ICDE Tuning an SQL-Based PDM System in a Worldwide Client/Server Environment. Erich Müller,Peter Dadam,Jost Enderle,M. Feltes 2001 Tuning an SQL-Based PDM System in a Worldwide Client/Server Environment. ICDE Bringing the Internet to Your Database: Using SQLServer 2000 and XML to Build Loosely-Coupled Systems. Michael Rys 2001 Bringing the Internet to Your Database: Using SQLServer 2000 and XML to Build Loosely-Coupled Systems. ICDE Integrating Data Mining with SQL Databases: OLE DB for Data Mining. Amir Netz,Surajit Chaudhuri,Usama M. Fayyad,Jeff Bernhardt 2001 Integrating Data Mining with SQL Databases: OLE DB for Data Mining. ICDE Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. Sriram Padmanabhan,Timothy Malkemus,Ramesh C. Agarwal,Anant Jhingran 2001 Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. ICDE A Graph-Based Approach For Extracting Terminological Properties of Elements of XML Documents. Luigi Palopoli,Giorgio Terracina,Domenico Ursino 2001 A Graph-Based Approach For Extracting Terminological Properties of Elements of XML Documents. ICDE Rewriting OLAP Queries Using Materialized Views and Dimension Hierarchies in Data Warehouses. Chang-Sup Park,Myoung-Ho Kim,Yoon-Joon Lee 2001 Rewriting OLAP Queries Using Materialized Views and Dimension Hierarchies in Data Warehouses. ICDE SpinCircuit: A Collaboration Portal Powered by E-speak. Rabindra Pathak 2001 SpinCircuit: A Collaboration Portal Powered by E-speak. ICDE Mining Frequent Item Sets with Convertible Constraints. Jian Pei,Jiawei Han,Laks V. S. Lakshmanan 2001 Mining Frequent Item Sets with Convertible Constraints. ICDE PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth. Jian Pei,Jiawei Han,Behzad Mortazavi-Asl,Helen Pinto,Qiming Chen,Umeshwar Dayal,Meichun Hsu 2001 PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth. ICDE Pseudo Column Level Locking. Nagavamsi Ponnekanti 2001 Pseudo Column Level Locking. ICDE Using EELs, a Practical Approach to Outerjoin and Antijoin Reordering. Jun Rao,Bruce G. Lindsay,Guy M. Lohman,Hamid Pirahesh,David E. Simmen 2001 Using EELs, a Practical Approach to Outerjoin and Antijoin Reordering. ICDE XML Data and Object Databases: A Perfect Couple? Andreas Renner 2001 XML Data and Object Databases: A Perfect Couple? ICDE Cache-Aware Query Routing in a Cluster of Databases. Uwe Röhm,Klemens Böhm,Hans-Jörg Schek 2001 Cache-Aware Query Routing in a Cluster of Databases. ICDE Microsoft Server Technology for Mobile and Wireless Applications. Praveen Seshadri 2001 Microsoft Server Technology for Mobile and Wireless Applications. ICDE Querying XML Documents Made Easy: Nearest Concept Queries. Albrecht Schmidt,Martin L. Kersten,Menzo Windhouwer 2001 Querying XML Documents Made Easy: Nearest Concept Queries. ICDE Tamino - A DBMS designed for XML. Harald Schöning 2001 Tamino - A DBMS designed for XML. ICDE Integrating Semi-Join-Reducers into State of the Art Query Processors. Konrad Stocker,Donald Kossmann,Reinhard Braumandl,Alfons Kemper 2001 Integrating Semi-Join-Reducers into State of the Art Query Processors. ICDE Cache-on-Demand: Recycling with Certainty. Kian-Lee Tan,Shen-Tat Goh,Beng Chin Ooi 2001 Cache-on-Demand: Recycling with Certainty. ICDE Spatial Clustering in the Presence of Obstacles. Anthony K. H. Tung,Jean Hou,Jiawei Han 2001 Spatial Clustering in the Presence of Obstacles. ICDE TAR: Temporal Association Rules on Evolving Numerical Attributes. Wei Wang,Jiong Yang,Richard R. Muntz 2001 Data mining has been an area of increasing interest. The association rule discovery problem in particular has been widely studied. However, there are still some unresolved problems. For example, research on mining patterns in the evolution of numerical attributes is still lacking. This is both a challenging problem and one with significant practical applications in business, science, and medicine. In this paper we present a temporal association rule model for evolving numerical attributes. Metrics for qualifying a temporal association rule include the familiar measures of support and strength used in traditional association rule mining and a new metric called density. The density metric not only gives us a way to extract the rules that best represent the data, but also provides an effective mechanism to prune the search space. An efficient algorithm is devised for mining temporal association rules, which utilizes all three thresholds (especially the strength) to prune the search space drastically. Moreover, the resulting rules are represented in a concise manner via rule sets to reduce the output size. Experimental results on real and synthetic data sets demonstrate the efficiency of our algorithm ICDE An Index Structure for Efficient Reverse Nearest Neighbor Queries. Congjun Yang,King-Ip Lin 2001 An Index Structure for Efficient Reverse Nearest Neighbor Queries. ICDE Incremental Computation and Maintenance of Temporal Aggregates. Jun Yang,Jennifer Widom 2001 Abstract.We consider the problems of computing aggregation queries in temporal databases and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging since a single data update can cause aggregate results to change over the entire time line. We introduce a new index structure called the SB-tree, which incorporates features from both segment-trees and B-trees. SB-trees support fast lookup of aggregate results based on time and can be maintained efficiently when the data change. We extend the basic SB-tree index to handle cumulative (also called moving-window) aggregates, considering separatelycases when the window size is or is not fixed in advance. For materialized aggregate views in a temporal database or warehouse, we propose building and maintaining SB-tree indices instead of the views themselves. SIGMOD Conference SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. Shivnath Babu,Minos N. Garofalakis,Rajeev Rastogi 2001 "While a variety of lossy compression schemes have been developed for certain forms of digital data (e.g., images, audio, video), the area of lossy compression techniques for arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are clearly motivated by the ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed-quality approximate answers to queries over massive relational data sets. In this paper, we propose SPARTAN, a system that takes advantage of attribute semantics and data-mining models to perform lossy compression of massive data tables. SPARTAN is based on the novel idea of exploiting predictive data correlations and prescribed error tolerances for individual attributes to construct concise and accurate Classification and Regression Tree (CaRT) models for entire columns of a table. More precisely, SPARTAN selects a certain subset of attributes for which no values are explicitly stored in the compressed table; instead, concise CaRTs that predict these values (within the prescribed error bounds) are maintained. To restrict the huge search space and construction cost of possible CaRT predictors, SPARTAN employs sophisticated learning techniques and novel combinatorial optimization algorithms. Our experimentation with several real-life data sets offers convincing evidence of the effectiveness of SPARTAN's model-based approach — SPARTAN is able to consistently yield substantially better compression ratios than existing semantic or syntactic compression tools (e.g., gzip) while utilizing only small data samples for model inference." SIGMOD Conference Monitoring Business Processes through Event Correlation based on Dependency Model. Asaf Adi,David Botzer,Opher Etzion,Tali Yatzkar-Haham 2001 Events are at the core of reactive and proactive applications, which have become popular in many domains. This demo shows the monitoring of incoming events as a means to detect possible problems in the course of business processes using a dependency model. Contemporary modeling tools lack the capability to express the event semantics and relationships to other entities. This capability is useful when the events are based on a dependency model among business processes, applications and resources. The ability to express an event by employing a general dependency model, an to use it through a designated event correlation monitoring tool, enables the accomplishment of tasks such as impact analysis and business processes monitoring, including prediction of violation of constraints (such as: service level agreements). This demonstrated tool provides the system designer with the ability to define and describe events and their relationships to other events, objects and tasks. The model employs various conditional dependencies that are specific to the event domain. The demo shows how systems (business processes) are monitored using the dependency / event model, by applying rules using an event correlation engine with strong expressive power. This demo proposal describes the generic application development tool, the middleware architecture and framework and the demo. SIGMOD Conference Materialized View and Index Selection Tool for Microsoft SQL Server 2000. Sanjay Agrawal,Surajit Chaudhuri,Vivek R. Narasayya 2001 Materialized View and Index Selection Tool for Microsoft SQL Server 2000. SIGMOD Conference Generating Efficient Plans for Queries Using Views. Foto N. Afrati,Chen Li,Jeffrey D. Ullman 2001 We study the problem or generating efficient, equivalent rewritings using views to compute the answer to a query. We take the closed-world assumption, in which views are materialized from base relations, rather than views describing sources in terms of abstract predicates, as is common when the open-world assumption is used. In the closed-world model, there can be an infinite number of different rewritings that compute the same answer, yet have quite different performance. Query optimizers take a logical plan (a rewriting of the query) as an input, and generate efficient physical plans to compute the answer. Thus our goal is to generate a small subset of the possible logical plans without missing an optimal physical plan. We first consider a cost model that counts the number of subgoals in a physical plan, and show a search space that is guaranteed to include an optimal rewriting, if the query has a rewriting in terms of the views. We also develop an efficient algorithm for finding rewritings with the minimum number of subgoals. We then consider a cost model that counts the sizes of intermediate relations of a physical plan, without dropping any attributes, and give a search space for finding optimal rewritings. Our final cost model allows attributes to be dropped in intermediate relations. We show that, by careful variable renaming, it is possible to do better than the standard “supplementary relation” approach, by dropping attributes that the latter approach would retain. Experiments show that our algorithm of generating optimal rewritings has good efficiency and scalability. SIGMOD Conference Outlier Detection for High Dimensional Data. Charu C. Aggarwal,Philip S. Yu 2001 The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are high dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However, in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions. Consequently, for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non-obvious. In this paper, we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set. SIGMOD Conference Snowball: A Prototype System for Extracting Relations from Large Text Collections. Eugene Agichtein,Luis Gravano,Jeff Pavel,Viktoriya Sokolova,Aleksandr Voskoboynik 2001 Snowball: A Prototype System for Extracting Relations from Large Text Collections. SIGMOD Conference VQBD: Exploring Semistructured Data. Sudarshan S. Chawathe,Thomas Baby,Jihwang Yeo 2001 VQBD: Exploring Semistructured Data. SIGMOD Conference Minimization of Tree Pattern Queries. Sihem Amer-Yahia,SungRan Cho,Laks V. S. Lakshmanan,Divesh Srivastava 2001 "Tree patterns forms a natural basis to query tree-structured data such as XML and LDAP. Since the efficiency of tree pattern matching against a tree-structured database depends on the size of the pattern, it is essential to identify and eliminate redundant nodes in the pattern and do so as quickly as possible. In this paper, we study tree pattern minimization both in the absence and in the presence of integrity constraints (ICs) on the underlying tree-structured database. When no ICs are considered, we call the process of minimizing a tree pattern, constraint-independent minimization. We develop a polynomial time algorithm called CIM for this purpose. CIM's efficiency stems from two key properties: (i) a node cannot be redundant unless its children are, and (ii) the order of elimination of redundant nodes is immaterial. When ICs are considered for minimization, we refer to it as constraint-dependent minimization. For tree-structured databases, required child/descendant and type co-occurrence ICs are very natural. Under such ICs, we show that the minimal equivalent query is unique. We show the surprising result that the algorithm obtained by first augmenting the tree pattern using ICS, and then applying CIM, always finds the unique minimal equivalent query; we refer to this algorithm as ACIM. While ACIM is also polynomial time, it can be expensive in practice because of its inherent non-locality. We then present a fast algorithm, CDM, that identifies and eliminates local redundancies due to ICs, based on propagating “information labels” up the tree pattern. CDM can be applied prior to ACIM for improving the minimization efficiency. We complement our analytical results with an experimental study that shows the effectiveness of our tree pattern minimization techniques." SIGMOD Conference Securing XML Documents: the Author-X Project Demonstration. Elisa Bertino,Silvana Castano,Elena Ferrari 2001 Securing XML Documents: the Author-X Project Demonstration. SIGMOD Conference Main-Memory Index Structures with Fixed-Size Partial Keys. Philip Bohannon,Peter McIlroy,Rajeev Rastogi 2001 The performance of main-memory index structures is increasingly determined by the number of CPU cache misses incurred when traversing the index. When keys are stored indirectly, as is standard in main-memory databases, the cost of key retrieval in terms of cache misses can dominate the cost of an index traversal. Yet it is inefficient in both time and space to store even moderate sized keys directly in index nodes. In this paper, we investigate the performance of tree structures suitable for OLTP workloads in the face of expensive cache misses and non-trivial key sizes. We propose two index structures, pkT-trees and pkB-trees, which significantly reduce cache misses by storing partial-key information in the index. We show that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation. Finally, we study the performance and cache behavior of partial-key trees by comparing them with other main-memory tree structures for a wide variety of key sizes and key value distributions. SIGMOD Conference Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. Christian Böhm,Bernhard Braunmüller,Florian Krebs,Hans-Peter Kriegel 2001 The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter &egr;. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length &egr; over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the &egr;-kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown. SIGMOD Conference Automatic Segmentation of Text into Structured Records. Vinayak R. Borkar,Kaustubh Deshmukh,Sunita Sarawagi 2001 In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems. We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy. SIGMOD Conference Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering. Markus M. Breunig,Hans-Peter Kriegel,Peer Kröger,Jörg Sander 2001 In this paper, we investigate how to scale hierarchical clustering methods (such as OPTICS) to extremely large databases by utilizing data compression methods (such as BIRCH or random sampling). We propose a three step procedure: 1) compress the data into suitable representative objects; 2) apply the hierarchical clustering algorithm only to these objects; 3) recover the clustering structure for the whole data set, based on the result for the compressed data. The key issue in this approach is to design compressed data items such that not only a hierarchical clustering algorithm can be applied, but also that they contain enough information to infer the clustering structure of the original data set in the third step. This is crucial because the results of hierarchical clustering algorithms, when applied naively to a random sample or to the clustering features (CFs) generated by BIRCH, deteriorate rapidly for higher compression rates. This is due to three key problems, which we identify. To solve these problems, we propose an efficient post-processing step and the concept of a Data Bubble as a special kind of compressed data item. Applying OPTICS to these Data Bubbles allows us to recover a very accurate approximation of the clustering structure of a large data set even for very high compression rates. A comprehensive performance and quality evaluation shows that we only trade very little quality of the clustering result for a great increase in performance. SIGMOD Conference STHoles: A Multidimensional Workload-Aware Histogram. Nicolas Bruno,Surajit Chaudhuri,Luis Gravano 2001 Attributes of a relation are not typically independent. Multidimensional histograms can be an effective tool for accurate multiattribute query selectivity estimation. In this paper, we introduce STHoles, a “workload-aware” histogram that allows bucket nesting to capture data regions with reasonably uniform tuple density. STHoles histograms are built without examining the data sets, but rather by just analyzing query results. Buckets are allocated where needed the most as indicated by the workload, which leads to accurate query selectivity estimations. Our extensive experiments demonstrate that STHoles histograms consistently produce good selectivity estimates across synthetic and real-world data sets and across query workloads, and, in many cases, outperform the best multidimensional histogram techniques that require access to and processing of the full data sets during histogram construction. SIGMOD Conference Semantic B2B Integration. Christoph Bussler 2001 The tutorial “Semantic B2B Integration” will give an introduction to the field of business-to-business (B2B) integration from a technical viewpoint with the focus on semantic integration aspects. The set of B2B integration concepts is introduced as well as their implementation in form of a technical semantic B2B integration architecture. A mix of examples is taken illustrating the problems that need to be solved in semantic B2B integration projects. The tutorial enables the audience to identify semantic B2B integration problems as well as to determine the benefits and deficiencies of various technical integration architecture approaches or B2B integration technologies. SIGMOD Conference OminiSearch: A Method for Searching Dynamic Content on the Web. David Buttler,Ling Liu,Calton Pu,Henrique Paques 2001 OminiSearch: A Method for Searching Dynamic Content on the Web. SIGMOD Conference Enabling Dynamic Content Caching for Database-Driven Web Sites. K. Selçuk Candan,Wen-Syan Li,Qiong Luo,Wang-Pin Hsiung,Divyakant Agrawal 2001 Web performance is a key differentiation among content providers. Snafus and slowdowns at major web sites demonstrate the difficulty that companies face trying to scale to a large amount of web traffic. One solution to this problem is to store web content at server-side and edge-caches for fast delivery to the end users. However, for many e-commerce sites, web pages are created dynamically based on the current state of business processes, represented in application servers and databases. Since application servers, databases, web servers, and caches are independent components, there is no efficient mechanism to make changes in the database content reflected to the cached web pages. As a result, most application servers have to mark dynamically generated web pages as non-cacheable. In this paper, we describe the architectural framework of the CachePortal system for enabling dynamic content caching for database-driven e-commerce sites. We describe techniques for intelligently invalidating dynamically generated web pages in the caches, thereby enabling caching of web pages generated based on database contents. We use some of the most popular components in the industry to illustrate the deployment and applicability of the proposed architecture. SIGMOD Conference StorHouse Metanoia - New Applications for Database, Storage & Data Warehousing. Felipe Cariño,Pekka Kostamaa,Art Kaufmann,John Burgess 2001 This paper describes the StorHouse/Relational Manager (RM) database system that uses and exploits an active storage hierarchy. By active storage hierarchy, we mean that StorHouse/RM executes SQL queries directly against data stored on all hierarchical storage (i.e. disk, optical, and tape) without post processing a file or a DBA having to manage a data set. We describe and analyze StorHouse/RM features and internals. We also describe how StorHouse/RM differs from traditional HSM (Hierarchical Storage Management) systems. For commercial applications we describe an evolution to the Data Warehouse concept, called Atomic Data Store, whereby atomic data is stored in the database system. Atomic data is defined as storing all the historic data values and executing queries against them. We also describe a Hub-and-Spoke Data Warehouse architecture, which is used to feed or fuel data into Data Marts. Furthermore, we provide analysis how StorHouse/RM can be federated with DB2, Oracle and Microsoft SQL Server 7 (SS7) and thus provide these databases an active storage hierarchy (i.e. tape). We then show two federated data modeling techniques (a) logical horizontal partitioning (LHP) of tuples and (b) logical vertical partitioning (LVP) of columns to demonstrate our database extension capabilities. We conclude with a TPC-like performance analysis of data stored on tape and disk. SIGMOD Conference Models and Languages for Describing and Discovering E-Services. Fabio Casati,Ming-Chien Shan 2001 Models and Languages for Describing and Discovering E-Services. SIGMOD Conference The Prototype of the DARE System. Tiziana Catarci,Giuseppe Santucci 2001 The Prototype of the DARE System. SIGMOD Conference PBIR - Perception-Based Image Retrieval. Edward Y. Chang,Tim Cheng,Lihyuarn L. Chang 2001 "We demonstrate a system that we have built on our proposed perception-based image retrieval (PBIR) paradigm. This PBIR system achieves accurate similarity measurements by rooting image characterization in human perception and by learning user's query concept through an intelligent sampling process. We show that our system can usually grasp a user's query concept with a small number of labeled instances." SIGMOD Conference A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. Surajit Chaudhuri,Gautam Das,Vivek R. Narasayya 2001 The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support and data mining tools. In contrast to previous sampling-based studies, we treat the problem as an optimization problem whose goal is to minimize the error in answering queries in the given workload. A key novelty of our approach is that we can tailor the choice of samples to be robust even for workloads that are “similar” but not necessarily identical to the given workload. Finally, our techniques recognize the importance of taking into account the variance in the data distribution in a principled manner. We show how our solution can be implemented on a database system, and present results of extensive experiments on Microsoft SQL Server 2000 that demonstrate the superior quality of our method compared to previous work. SIGMOD Conference Query Optimization In Compressed Database Systems. Zhiyuan Chen,Johannes Gehrke,Flip Korn 2001 Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed format on disk. Despite the abundance of string-valued attributes in relational schemas there is little work on compression for string attributes in a database context. Moreover, none of the previous work suitably addresses the role of the query optimizer: During query execution, data is either eagerly decompressed when it is read into main memory, or data lazily stays compressed in main memory and is decompressed on demand only In this paper, we present an effective approach for database compression based on lightweight, attribute-level compression techniques. We propose a IIierarchical Dictionary Encoding strategy that intelligently selects the most effective compression method for string-valued attributes. We show that eager and lazy decompression strategies produce sub-optimal plans for queries involving compressed string attributes. We then formalize the problem of compression-aware query optimization and propose one provably optimal and two fast heuristic algorithms for selecting a query plan for relational schemas with compressed attributes; our algorithms can easily be integrated into existing cost-based query optimizers. Experiments using TPC-H data demonstrate the impact of our string compression methods and show the importance of compression-aware query optimization. Our approach results in up to an order speed up over existing approaches. SIGMOD Conference Improving Index Performance through Prefetching. Shimin Chen,Phillip B. Gibbons,Todd C. Mowry 2001 This paper proposes and evaluate Prefetching B+-Trees (pB+-Trees), which use prefetching to accelerate two important operations on B+-Tree indices: searches and range scans. To accelerate searches, pB+-Trees use prefetching to effectively create wider nodes than the natural data transfer size: e.g., eight vs. one cache lines or disk pages. These wider nodes reduce the height of the B+-Tree, thereby decreasing the number of expensive misses when going from parent to child without significantly increasing the cost of fetching a given node. Our results show that this technique speeds up search and update times by a factor of 1.21-1.5 for main-memory B+-Trees. In addition, it outperforms and is complementary to “Cache-Sensitive B+-Trees.” To accelerate range scans, pB+-Trees provide arrays of pointers to their leaf nodes. These allow the pB+-Tree to prefetch arbitrarily far ahead, even for nonclustered indices, thereby hiding the normally expensive cache misses associated with traversing the leaves within the range. Our results show that this technique yields over a sixfold speedup on range scans of 1000+ keys. Although our experimental evaluation focuses on main memory databases, the techniques that we propose are also applicable to hiding disk latency. SIGMOD Conference DyDa: Data Warehouse Maintenance in Fully Concurrent Environments. Jun Chen,Xin Zhang,Songting Chen,Andreas Koeller,Elke A. Rundensteiner 2001 DyDa: Data Warehouse Maintenance in Fully Concurrent Environments. SIGMOD Conference Gangam - A Solution to Support Multiple Data Models, their Mappings and Maintenance. Kajal T. Claypool,Elke A. Rundensteiner,Xin Zhang,Hong Su,Harumi A. Kuno,Wang-Chien Lee,Gail Mitchell 2001 Gangam - A Solution to Support Multiple Data Models, their Mappings and Maintenance. SIGMOD Conference Dynamic Content Acceleration: A Caching Solution to Enable Scalable Dynamic Web Page Generation. Anindya Datta,Kaushik Dutta,Krithi Ramamritham,Helen M. Thomas,Debra E. VanderMeer 2001 Dynamic Content Acceleration: A Caching Solution to Enable Scalable Dynamic Web Page Generation. SIGMOD Conference Dissemination of Dynamic Data. Pavan Deolasee,Amol Katkar,Ankur Panchbudhe,Krithi Ramamritham,Prashant J. Shenoy 2001 Dissemination of Dynamic Data. SIGMOD Conference Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. Amol Deshpande,Minos N. Garofalakis,Rajeev Rastogi 2001 Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”. SIGMOD Conference Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. AnHai Doan,Pedro Domingos,Alon Y. Halevy 2001 A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machine-learning techniques to semi-automatically find such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with the sources to train a set of learners. Each learner exploits a different type of information either in the source schemas or in their data. Once the learners have been trained, LSD finds semantic mappings for a new data source by applying the learners, then combining their predictions using a meta-learner. To further improve matching accuracy, we extend machine learning techniques so that LSD can incorporate domain constraints as an additional source of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of information. We describe a set of experiments on several real-world domains, and show that LSD proposes semantic mappings with a high degree of accuracy. SIGMOD Conference The Nimble Integration Engine. Denise Draper,Alon Y. Halevy,Daniel S. Weld 2001 The consensus that XML has become the de facto standard for data interchange will spur demand for technology that allows users to integrate data from a variety of applications, repositories, and legacy systems which are located across the corporate intranet or at partner companies on the Internet. In the past two years, Nimble Technology has developed a product for this market. Spawned from over a person-decade of data integration research, the product has been deployed at several Fortune-500 beta-customer sites. This abstract reports on the key challenges we faced in the design of our product and highlights some issues we think require more attention from the research community. SIGMOD Conference Filtering Algorithms and Implementation for Very Fast Publish/Subscribe. Françoise Fabret,Hans-Arno Jacobsen,François Llirbat,João Pereira,Kenneth A. Ross,Dennis Shasha 2001 Filtering Algorithms and Implementation for Very Fast Publish/Subscribe. SIGMOD Conference Efficient Evaluation of XML Middle-ware Queries. Mary F. Fernandez,Atsuyuki Morishima,Dan Suciu 2001 We address the problem of efficiently constructing materialized XML views of relational databases. In our setting, the XML view is specified by a query in the declarative query language of a middle-ware system, called SilkRoute. The middle-ware system evaluates a query by sending one or more SQL queries to the target relational database, integrating the resulting tuple streams, and adding the XML tags. We focus on how to best choose the SQL queries, without having control over the target RDBMS. SIGMOD Conference Orthogonal Optimization of Subqueries and Aggregation. César A. Galindo-Legaria,Milind Joshi 2001 There is considerable overlap between strategies proposed for subquery evaluation, and those for grouping and aggregation. In this paper we show how a number of small, independent primitives generate a rich set of efficient execution strategies —covering standard proposals for subquery evaluation suggested in earlier literature. These small primitives fall into two main, orthogonal areas: Correlation removal, and efficient processing of outerjoins and GroupBy. An optimization approach based on these pieces provides syntax-independence of query processing with respect to subqueries, i. e. equivalent queries written with or without subquery produce the same efficient plan. We describe techniques implemented in Microsoft SQL Server (releases 7.0 and 8.0) for queries containing sub-queries and/or aggregations, based on a number of orthogonal optimizations. We concentrate separately on removing correlated subqueries, also called “query flattening,” and on efficient execution of queries with aggregations. The end result is a modular, flexible implementation, which produces very efficient execution plans. To demonstrate the validity of our approach, we present results for some queries from the TPC-H benchmark. From all published TPC-H results in the 300GB scale, at the time of writing (November 2000), SQL Server has the fastest results on those queries, even on a fraction of the processors used by other systems. SIGMOD Conference Optimizing Queries Using Materialized Views: A practical, scalable solution. Jonathan Goldstein,Per-Åke Larson 2001 Materialized views can provide massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must know how and when to exploit materialized views. This paper presents a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and describes how it can be incorporated in transformation-based optimizers. The current version handles views composed of selections, joins and a final group-by. Optimization remains fully cost based, that is, a single “best” rewrite is not selected by heuristic rules but multiple rewrites are generated and the optimizer chooses the best alternative in the normal way. Experimental results based on an implementation in Microsoft SQL Server show outstanding performance and scalability. Optimization time increases slowly with the number of views but remains low even up to a thousand. SIGMOD Conference Space-Efficient Online Computation of Quantile Summaries. Michael Greenwald,Sanjeev Khanna 2001 An ∈-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of ∈N. We present a new online algorithm for computing∈-approximate quantile summaries of very large data sequences. The algorithm has a worst-case space requirement of &Ogr;(1÷∈ log(∈N)). This improves upon the previous best result of &Ogr;(1÷∈ log2(∈N)). Moreover, in contrast to earlier deterministic algorithms, our algorithm does not require a priori knowledge of the length of the input sequence. Finally, the actual space bounds obtained on experimental data are significantly better than the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms. SIGMOD Conference On Computing Correlated Aggregates Over Continual Data Streams. Johannes Gehrke,Flip Korn,Divesh Srivastava 2001 In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible. We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams. SIGMOD Conference Selectivity Estimation using Probabilistic Models. Lise Getoor,Benjamin Taskar,Daphne Koller 2001 Estimating the result size of complex queries that involve selection on multiple attributes and the join of several relations is a difficult but fundamental task in database query processing. It arises in cost-based query optimization, query profiling, and approximate query answering. In this paper, we show how probabilistic graphical models can be effectively used for this task as an accurate and compact approximation of the joint frequency distribution of multiple attributes across multiple relations. Probabilistic Relational Models (PRMs) are a recent development that extends graphical statistical models such as Bayesian Networks to relational domains. They represent the statistical dependencies between attributes within a table, and between attributes across foreign-key joins. We provide an efficient algorithm for constructing a PRM front a database, and show how a PRM can be used to compute selectivity estimates for a broad class of queries. One of the major contributions of this work is a unified framework for the estimation of queries involving both select and foreign-key join operations. Furthermore, our approach is not limited to answering a small set of predetermined queries; a single model can be used to effectively estimate the sizes of a wide collection of potential queries across multiple tables. We present results for our approach on several real-world databases. For both single-table multi-attribute queries and a general class of select-join queries, our approach produces more accurate estimates than standard approaches to selectivity estimation, using comparable space and time. SIGMOD Conference Time Series Similarity Measures and Time Series Indexing. Dimitrios Gunopulos,Gautam Das 2001 Time Series Similarity Measures and Time Series Indexing. SIGMOD Conference Efficient and Tunable Similar Set Retrieval. Aristides Gionis,Dimitrios Gunopulos,Nick Koudas 2001 Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure. We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme. SIGMOD Conference Exploiting Constraint-Like Data Characterizations in Query Optimization. Parke Godfrey,Jarek Gryz,Calisto Zuzarte 2001 Query optimizers nowadays draw upon many sources of information about the database to optimize queries. They employ runtime statistics in cost-based estimation of query plans. They employ integrity constraints in the query rewrite process. Primary and foreign key constraints have long played a role in the optimizer, both for rewrite opportunities and for providing more accurate cost predictions. More recently, other types of integrity constraints are being exploited by optimizers in commercial systems, for which certain semantic query optimization techniques have now been implemented. These new optimization strategies that exploit constraints hold the promise for good improvement. Their weakness, however, is that often the “constraints” that would be useful for optimization for a given database and workload are not explicitly available for the optimizer. Data mining tools can find such “constraints” that are true of the database, but then there is the question of how this information can be kept by the database system, and how to make this information available to, and effectively usable by, the optimizer. We present our work on soft constraints in DB2. A soft constraint is a syntactic statement equivalent to an integrity constraint declaration. A soft constraint is not really a constraint, per se, since future updates may undermine it. While a soft constraint is valid, however, it can be used by the optimizer in the same way integrity constraints are. We present two forms of soft constraint: absolute and statistical. An absolute soft constraint is consistent with respect to the current state of the database, just in the same way an integrity constraint must be. They can be used in rewrite, as well as in cost estimation. A statistical soft constraint differs in that it may have some degree of violation with respect to the state of the database. Thus, statistical soft constraints cannot be used in rewrite, but they can still be used in cost estimation. We are working long-term on absolute soft constraints. We discuss the issues involved in implementing a facility for absolute soft constraints in a database system (and in DB2), and the strategies that we are researching. The current DB2 optimizer is more amenable to adding facilities for statistical soft constraints. In the short-term, we have been implementing pathways in the optimizer for statistical soft constraints. We discuss this implementation. SIGMOD Conference Online Query Processing. Peter J. Haas,Joseph M. Hellerstein 2001 Online Query Processing. SIGMOD Conference Efficient Computation of Iceberg Cubes with Complex Measures. Jiawei Han,Jian Pei,Guozhu Dong,Ke Wang 2001 It is often too expensive to compute and materialize a complete high-dimensional data cube. Computing an iceberg cube, which contains only aggregates above certain thresholds, is an effective way to derive nontrivial multi-dimensional aggregations for OLAP and data mining. In this paper, we study efficient methods for computing iceberg cubes with some popularly used complex measures, such as average, and develop a methodology that adopts a weaker but anti-monotonic condition for testing and pruning search space. In particular, for efficient computation of iceberg cubes with the average measure, we propose a top-k average pruning method and extend two previously studied methods, Apriori and BUC, to Top-k Apriori and Top-k BUC. To further improve the performance, an interesting hypertree structure, called H-tree, is designed and a new iceberg cubing method, called Top-k H-Cubing, is developed. Our performance study shows that Top-k BUC and Top-k H-Cubing are two promising candidates for scalable computation, and Top-k H-Cubing has better performance in most cases. SIGMOD Conference DNA-Miner: A System Prototype for Mining DNA Sequences. Jiawei Han,Hasan M. Jamil,Ying Lu,Liangyou Chen,Yaqin Liao,Jian Pei 2001 DNA-Miner: A System Prototype for Mining DNA Sequences. SIGMOD Conference Clio: A Semi-Automatic Tool For Schema Mapping. Mauricio A. Hernández,Renée J. Miller,Laura M. Haas 2001 Clio: A Semi-Automatic Tool For Schema Mapping. SIGMOD Conference PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries. Vagelis Hristidis,Nick Koudas,Yannis Papakonstantinou 2001 "Users often need to optimize the selection of objects by appropriately weighting the importance of multiple object attributes. Such optimization problems appear often in operations' research and applied mathematics as well as everyday life; e.g., a buyer may select a home as a weighted function of a number of attributes like its distance from office, its price, its area, etc. We capture such queries in our definition of preference queries that use a weight function over a relation's attributes to derive a score for each tuple. Database systems cannot efficiently produce the top results of a preference query because they need to evaluate the weight function over all tuples of the relation. PREFER answers preference queries efficiently by using materialized views that have been pre-processed and stored. We first show how the result of a preference query can be produced in a pipelined fashion using a materialized view. Then we show that excellent performance can be delivered given a reasonable number of materialized views and we provide an algorithm that selects a number of views to precompute and materialize given space constraints. We have implemented the algorithms proposed in this paper in a prototype system called PREFER, which operates on top of a commercial database management system. We present the results of a performance comparison, comparing our algorithms with prior approaches using synthetic datasets. Our results indicate that the proposed algorithms are superior in performance compared to other approaches, both in preprocessing (preparation of materialized views) as well as execution time." SIGMOD Conference Probe, Count, and Classify: Categorizing Hidden Web Databases. Panagiotis G. Ipeirotis,Luis Gravano,Mehran Sahami 2001 The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web “crawlers.” Recent studies have estimated the size of this “hidden web” to be 500 billion pages, while the size of the “crawlable” web is only an estimated two billion pages. Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. In this paper, we introduce a method for automating this classification process by using a small number of query probes. To classify a database, our algorithm does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of our technique over collections of real documents, including over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases. SIGMOD Conference Global Optimization of Histograms. H. V. Jagadish,Hui Jin,Beng Chin Ooi,Kian-Lee Tan 2001 Histograms are frequently used to represent the distribution of data values in an attribute of a relation. Most previous work has focused on identifying the optimal histogram (given a limited number of buckets) for a single attribute independent of other attributes/histograms. In this paper, we propose the idea of global optimization of histograms, i.e., single-attribute histograms for a set of attributes are optimized collectively so as to minimize the overall error in using the histograms. The idea is to allocate more buckets to histograms whose attributes are more frequently used and/or distributions are highly skewed. While the accuracy of some histograms is penalized (being assigned fewer buckets), we expect the global error to be low compared to the traditional method (of allocating equal number of buckets to each histogram). We propose two algorithms to determine the histograms to construct for a collection of attributes. The first is based on dynamic programming, and the second is a greedy algorithm. We compare the overall error of these algorithms against the traditional method. Extensive experiments are conducted and the results confirm the benefits of global optimal histograms in reducing the overall error. The extent of improvement depends on the data and query distributions, ranging from no benefit when there is no significant differences in the data distributions to over a factor of 100 reduction in error in some cases we tried. The time to compute global optimal histogram using dynamic programming is much longer than the time to compute optimal histograms separately for each attribute, and the difference widens at a faster rate as the number of histograms increases. With the greedy algorithm, the time penalty is small, but the error reduction is somewhat less as well. We propose a third algorithm, called greedy algorithm with remedy, that has running time similar to the greedy algorithm, but produces results close to global optimum. In fact, in every experiment that we tried, this algorithm found the exact global optimum. SIGMOD Conference Mining Needle in a Haystack: Classifying Rare Classes via Two-phase Rule Induction. Mahesh V. Joshi,Ramesh C. Agarwal,Vipin Kumar 2001 Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision. SIGMOD Conference Proxy-Server Architectures for OLAP. Panos Kalnis,Dimitris Papadias 2001 Data warehouses have been successfully employed for assisting decision making by offering a global view of the enterprise data and providing mechanisms for On-Line Analytical processing. Traditionally, data warehouses are utilized within the limits of an enterprise or organization. The growth of Internet and WWW however, has created new opportunities for data sharing among ad-hoc, geographically spanned and possibly mobile users. Since it is impractical for each enterprise to set up a worldwide infrastructure, currently such applications are handled by the central warehouse. This often yields poor performance, due to overloading of the central server and low transfer rate of the network. In this paper we propose an architecture for OLAP cache servers (OCS). An OCS is the equivalent of a proxy-server for web documents, but it is designed to accommodate data from warehouses and support OLAP operations. We allow numerous OCSs to be connected via an arbitrary network, and present a centralized, a semi-centralized and an autonomous control policy. We experimentally evaluate these policies and compare the performance gain against the existing systems where caching is performed only at the client side. Our architecture offers increased autonomy at remote clients, substantial network traffic savings, better scalability, lower response time and is complementary both to existing OLAP cache systems and distributed OLAP approaches. SIGMOD Conference Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. Eamonn J. Keogh,Kaushik Chakrabarti,Sharad Mehrotra,Michael J. Pazzani 2001 Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. SIGMOD Conference COSIMA - Your Smart, Speaking E-Salesperson. Werner Kießling,Stefan Holland,Stefan Fischer,Thorsten Ehm 2001 "We present a new cooperative user interface for e-shopping. COSIMA is an intelligent Internet avatar with dynamic voice output that assists customers through their e-shopping tours and advises them like a real salesperson. COSIMA uses a meta search engine based on Preference SQL, computing best matching results to the customer's wishes. COSIMA can qualify these results and generates proper voice output. Our presentation shows COSIMA in action for comparison shopping." SIGMOD Conference Optimizing Multidimensional Index Trees for Main Memory Access. Kihong Kim,Sang Kyun Cha,Keunjoo Kwon 2001 "Recent studies have shown that cache-conscious indexes such as the CSB+-tree outperform conventional main memory indexes such as the T-tree. The key idea of these cache-conscious indexes is to eliminate most of child pointers from a node to increase the fanout of the tree. When the node size is chosen in the order of the cache block size, this pointer elimination effectively reduces the tree height, and thus improves the cache behavior of the index. However, the pointer elimination cannot be directly applied to multidimensional index structures such as the R-tree, where the size of a key, typically, an MBR (minimum bounding rectangle), is much larger than that of a pointer. Simple elimination of four-byte pointers does not help much to pack more entries in a node. This paper proposes a cache-conscious version of the R-tree called the CR-tree. To pack more entries in a node, the CR-tree compresses MBR keys, which occupy almost 80% of index data in the two-dimensional case. It first represents the coordinates of an MBR key relatively to the lower left corner of its parent MBR to eliminate the leading O's from the relative coordinate representation. Then, it quantizes the relative coordinates with a fixed number of bits to further cut off the trailing less significant bits. Consequently, the CR-tree becomes significantly wider and smaller than the ordinary R-tree. Our experimental and analytical study shows that the two-dimensional CR-tree performs search up to 2.5 times faster than the ordinary R-tree while maintaining similar update performance and consuming about 60% less memory space." SIGMOD Conference Spatial Data Management for Computer Aided Design. Hans-Peter Kriegel,Andreas Müller,Marco Pötke,Thomas Seidl 2001 This demonstration presents a spatial database integration for novel CAD applications into off-the-shelf database systems. Spatial queries on even large product databases for digital mockup or haptic rendering are performed at interactive response times. SIGMOD Conference Fast-Start: Quick Fault Recovery in Oracle. Tirthankar Lahiri,Amit Ganesh,Ron Weiss,Ashok Joshi 2001 Availability requirements for database systems are more stringent than ever before with the widespread use of databases as the foundation for ebusiness. This paper highlights Fast-Start™ Fault Recovery, an important availability feature in Oracle, designed to expedite recovery from unplanned outages. Fast-Start allows the administrator to configure a running system to impose predictable bounds on the time required for crash recovery. For instance, fast-start allows fine-grained control over the duration of the roll-forward phase of crash recovery by adaptively varying the rate of checkpointing with minimal impact on online performance. Persistent transaction locking in Oracle allows normal online processing to be resumed while the rollback phase of recovery is still in progress, and fast-start allows quick and transparent rollback of changes made by uncommitted transactions prior to a crash. SIGMOD Conference RETINA: A REal-time TraffIc NAvigation System. Kam-yiu Lam,Edward Chan,Tei-Wei Kuo,S. W. Ng,Dick Hung 2001 RETINA: A REal-time TraffIc NAvigation System. SIGMOD Conference Modeling High-Dimensional Index Structures using Sampling. Christian A. Lang,Ambuj K. Singh 2001 A large number of index structures for high-dimensional data have been proposed previously. In order to tune and compare such index structures, it is vital to have efficient cost prediction techniques for these structures. Previous techniques either assume uniformity of the data or are not applicable to high-dimensional data. We propose the use of sampling to predict the number of accessed index pages during a query execution. Sampling is independent of the dimensionality and preserves clusters which is important for representing skewed data. We present a general model for estimating the index page layout using sampling and show how to compensate for errors. We then give an implementation of our model under restricted memory assumptions and show that it performs well even under these constraints. Errors are minimal and the overall prediction time is up to two orders of magnitude below the time for building and probing the full index without sampling. SIGMOD Conference XML Data Management Go Native or Spruce up Relational Systems? (Panel Abstract). Per-Åke Larson 2001 XML Data Management Go Native or Spruce up Relational Systems? (Panel Abstract). SIGMOD Conference Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure. Iosif Lazaridis,Sharad Mehrotra 2001 Answering aggregate queries like SUM, COUNT, MIN, MAX, AVG in an approximate manner is often desirable when the exact answer is not needed or too costly to compute. We present an algorithm for answering such queries in multi-dimensional databases, using selective traversal of a Multi-Resolution Aggregate (MRA) tree structure storing point data. Our approach provides 100% intervals of confidence on the value of the aggregate and works iteratively, coming up with improving quality answers, until some error requirement is satisfied or time constraint as reached. Using the same technique we can also answer aggregate queries exactly and our experiments indicate that even for exact answering the proposed data structure and algorithm are very fast. SIGMOD Conference Dynamic Buffer Allocation in Video-on-Demand Systems. Sang Ho Lee,Kyu-Young Whang,Yang-Sae Moon,Il-Yeol Song 2001 In video-on-demand (VOD) systems, as the size of the buffer allocated to user requests increases, initial latency and memory requirements increase. Hence, the buffer size must be minimized. The existing static buffer allocation scheme, however, determines the buffer size based on the assumption that the system is in the fully loaded state. Thus, when the system is in a partially loaded state, the scheme allocates a buffer larger than necessary to a user request. This paper proposes a dynamic buffer allocation scheme that allocates to user requests buffers of the minimum size in a partially loaded state as well as in the fully loaded state. The inherent difficulty in determining the buffer size in the dynamic buffer allocation scheme is that the size of the buffer currently being allocated is dependent on the number of and the sizes of the buffers to be allocated in the next service period. We solve this problem by the predict-and-enforce strategy, where we predict the number and the sizes of future buffers based on inertia assumptions and enforce these assumptions at runtime. Any violation of these assumptions is resolved by deferring service to the violating new user request until the assumptions are satisfied. Since the size of the current buffer is dependent on the sizes of the future buffers, the size is represented by a recurrence equation. We provide a solution to this equation, which can be computed at the system initialization time for runtime efficiency. We have performed extensive analysis and simulation. The results show that the dynamic buffer allocation scheme reduces initial latency (averaged over the number of user requests in service from one to the maximum capacity) to 1 ÷ 29.4 ≁ 1 ÷ 11.0 of that for the static one and, by reducing the memory requirement, increases the number of concurrent user requests to 2.36 ∼ 3.25 times that of the static one when averaged over the amount of system memory available. These results demonstrate that the dynamic buffer allocation scheme significantly improves the performance and capacity of VOD systems. SIGMOD Conference Catalog Management in Websphere Commerce Suite. Thomas Maguire 2001 Catalog Management in Websphere Commerce Suite. SIGMOD Conference Data Management: Lasting Impact of the Wild, Wild, Web. Reed M. Meseck 2001 Data Management: Lasting Impact of the Wild, Wild, Web. SIGMOD Conference Materialized View Selection and Maintenance Using Multi-Query Optimization. Hoshi Mistry,Prasan Roy,S. Sudarshan,Krithi Ramamritham 2001 Materialized views have been found to be very effective at speeding up queries, and are increasingly being supported by commercial databases and data warehouse systems. However, whereas the amount of data entering a warehouse and the number of materialized views are rapidly increasing, the time window available for maintaining materialized views is shrinking. These trends necessitate efficient techniques for the maintenance of materialized views. In this paper, we show how to find an efficient plan for the maintenance of a set of materialized views, by exploiting common subexpressions between different view maintenance expressions. In particular, we show how to efficiently select (a) expressions and indices that can be effectively shared, by transient materialization; (b) additional expressions and indices for permanent materialization; and (c) the best maintenance plan — incremental or recomputation — for each view. These three decisions are highly interdependent, and the choice of one affects the choice of the others. We develop a framework that cleanly integrates the various choices in a systematic and efficient manner. Our evaluations show that many-fold improvement in view maintenance time can be achieved using our techniques. Our algorithms can also be used to efficiently select materialized views to speed up workloads containing queries and updates. SIGMOD Conference Application Servers: Born-Again TP Monitors for the Web? (Panel Abstract). C. Mohan 2001 Application Servers: Born-Again TP Monitors for the Web? (Panel Abstract). SIGMOD Conference Experiences in Mining Aviation Safety Data. Zohreh Nazeri,Eric Bloedorn,Paul Ostwald 2001 The goal of data analysis in aviation safety is simple: improve safety. However, the path to this goal is hard to identify. What data mining methods are most applicable to this task? What data are available and how should they be analyzed? How do we focus on the most interesting results? Our answers to these questions are based on a recent research project we completed. The encouraging news is that we found a number of aviation safety offices doing commendable work to collect and analyze safety-related data. But we also found a number of areas where data mining techniques could provide new tools that either perform analyses that were not considered before, or that can now be done more easily. Currently, Aviation Safety offices collect and analyze the incident reports by a combination of manual and automated methods. Data analysis is done by safety officers who are well familiar with the domain, but not with data mining methods. Some Aviation Safety officers have tools to automate the database query and report generation process. However, the actual analysis is done by the officer with only fairly rudimentary tools to help extract the useful information from the data. Our research project looked at the application of data mining techniques to aviation safety data to help Aviation Safety officers with their analysis task. This effort led to the creation of a tool called the “Aviation Safety Data Mining Workbench”. This paper describes the research effort, the workbench, the experience with data mining of Aviation Safety data, and lessons learned. SIGMOD Conference Iceberg-cube computation with PC clusters. Raymond T. Ng,Alan S. Wagner,Yu Yin 2001 In this paper, we investigate the approach of using low cost PC cluster to parallelize the computation of iceberg-cube queries. We concentrate on techniques directed towards online querying of large, high-dimensional datasets where it is assumed that the total cube has net been precomputed. The algorithmic space we explore considers trade-offs between parallelism, computation and I/0. Our main contribution is the development and a comprehensive evaluation of various novel, parallel algorithms. Specifically: (1) Algorithm RP is a straightforward parallel version of BUC [BR99]; (2) Algorithm BPP attempts to reduce I/0 by outputting results in a more efficient way; (3) Algorithm ASL, which maintains cells in a cuboid in a skiplist, is designed to put the utmost priority on load balancing; and (4) alternatively, Algorithm PT load-balances by using binary partitioning to divide the cube lattice as evenly as possible. We present a thorough performance evaluation on all these algorithms on a variety of parameters, including the dimensionality of the cube, the sparseness of the cube, the selectivity of the constraints, the number of processors, and the size of the dataset. A key finding is that it is not a one-algorithm-fit-all situation. We recommend a “recipe” which uses PT as the default algorithm, but may also deploy ASL under specific circumstances. SIGMOD Conference Monitoring XML Data on the Web. Benjamin Nguyen,Serge Abiteboul,Gregory Cobena,Mihai Preda 2001 We consider the monitoring of a flow of incoming documents. More precisely, we present here the monitoring used in a very large warehouse built from XML documents found on the web. The flow of documents consists in XML pages (that are warehoused) and HTML pages (that are not). Our contributions are the following: a subscription language which specifies the monitoring of pages when fetched, the periodical evaluation of continuous queries and the production of XML reports. the description of the architecture of the system we implemented that makes it possible to monitor a flow of millions of pages per day with millions of subscriptions on a single PC, and scales up by using more machines. a new algorithm for processing alerts that can be used in a wider context. We support monitoring at the page level (e.g., discovery of a new page within a certain semantic domain) as well as at the element level (e.g., insertion of a new electronic product in a catalog). This work is part of the Xyleme system. Xyleme is developed on a cluster of PCs under Linux with Corba communications. The part of the system described in this paper has been implemented. We mention first experiments. SIGMOD Conference Adaptive Precision Setting for Cached Approximate Values. Chris Olston,Boon Thau Loo,Jennifer Widom 2001 Caching approximate values instead of exact values presents an opportunity for performance gains in exchange for decreased precision. To maximize the performance improvement, cached approximations must be of appropriate precision: approximations that are too precise easily become invalid, requiring frequent refreshing, while overly imprecise approximations are likely to be useless to applications, which must then bypass the cache. We present a parameterized algorithm for adjusting the precision of cached approximations adaptively to achieve the best performance as data values, precision requirements, or workload vary. We consider interval approximations to numeric values but our ideas can be extended to other kinds of data and approximations. Our algorithm strictly generalizes previous adaptive caching algorithms for exact copies: we can set parameters to require that all approximations be exact, in which case our algorithm dynamically chooses whether or not to cache each data value. We have implemented our algorithm and tested it on synthetic and real-world data. A number of experimental results are reported, showing the effectiveness of our algorithm at maximizing performance, and also showing that in the special case of exact caching our algorithm performs as well as previous algorithms. In cases where bounded imprecision is acceptable, our algorithm easily outperforms previous algorithms for exact caching. SIGMOD Conference Bit-Sliced Index Arithmetic. "Denis Rinfret,Patrick E. O'Neil,Elizabeth J. O'Neil" 2001 "The bit-sliced index (BSI) was originally defined in [ONQ97]. The current paper introduces the concept of BSI arithmetic. For any two BSI's X and Y on a table T, we show how to efficiently generate new BSI's Z, V, and W, such that Z = X + Y, V = X - Y, and W = MIN(X, Y); this means that if a row r in T has a value x represented in BSI X and a value y in BSI Y, the value for r in BSI Z will be x + y, the value in V will be x - y and the value in W will be MIN(x, y). Since a bitmap representing a set of rows is the simplest bit-sliced index, BSI arithmetic is the most straightforward way to determine multisets of rows (with duplicates) resulting from the SQL clauses UNION ALL (addition), EXCEPT ALL (subtraction), and INTERSECT ALL (min) (see [OO00, DB2SQL] for definitions of these clauses). Another contribution of the current paper is to generalize BSI range restrictions from [ONQ97] to a new non-Boolean form: to determine the top k BSI-valued rows, for ally meaningful value k between one and the total number of rows in T. Together with bit-sliced addition, this permits us to solve a common basic problem of text retrieval: given an object-relational table T of rows representing documents, with a collection type column K representing keyword terms, we demonstrate an efficient algorithm to find k documents that share the largest number of terms with some query list Q of terms. A great deal of published work on such problems exists in the Information Retrieval (IR) field. The algorithm we introduce, which we call Bit-Sliced Term-Matching, or BSTM, uses an approach comparable in performance to the most efficient known IR algorithm, a major improvement on current DBMS text searching algorithms, with the advantage that it uses only indexing we propose for native database operations." SIGMOD Conference Communication Efficient Distributed Mining of Association Rules. Assaf Schuster,Ran Wolff 2001 Mining for associations between items in large transactional databases is a central problem in the field of knowledge discovery. When the database is partitioned among several share-nothing machines, the problem can be addressed using distributed data mining algorithms. One such algorithm, called CD, was proposed by Agrawal and Shafer and was later enhanced by the FDM algorithm of Cheung, Han et al. The main problem with these algorithms is that they do not scale well with the number of partitions. They are thus impractical for use in modern distributed environments such as peer-to-peer systems, in which hundreds or thousands of computers may interact.In this paper we present a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication. In addition to being very efficient, the new algorithms are also extremely robust. Unlike existing algorithms, they continue to be efficient even when the data is skewed or the partition sizes are imbalanced. We present both experimental and theoretical results concerning the behavior of these algorithms and explain how they can be implemented in different settings. SIGMOD Conference Will Database Researchers Have ANY Role in Data Security? (Panel Abstract). Arnon Rosenthal 2001 Will Database Researchers Have ANY Role in Data Security? (Panel Abstract). SIGMOD Conference Fault-tolerant, Load-balancing Queries in Telegraph. Mehul A. Shah,Sirish Chandrasekaran 2001 Fault-tolerant, Load-balancing Queries in Telegraph. SIGMOD Conference "Kweelt: More than just ""yet another framework to query XML!""" Arnaud Sahuguet 2001 "Kweelt: More than just ""yet another framework to query XML!""" SIGMOD Conference REVIEW: A Real Time Virtual Walkthrough System. Lidan Shou,Jason Chionh,Kian-Lee Tan,Yixin Ruan,Zhiyong Huang 2001 REVIEW: A Real Time Virtual Walkthrough System. SIGMOD Conference Adaptable Query Optimization and Evaluation in Temporal Middleware. Giedrius Slivinskas,Christian S. Jensen,Richard T. Snodgrass 2001 "Time-referenced data are pervasive in most real-world databases. Recent advances in temporal query languages show that such database applications may benefit substantially from built-in temporal support in the DBMS. To achieve this, temporal query optimization and evaluation mechanisms must be provided, either within the DBMS proper or as a source level translation from temporal queries to conventional SQL. This paper proposes a new approach: using a middleware component on top of a conventional DBMS. This component accepts temporal SQL statements and produces a corresponding query plan consisting of algebraic as well as regular SQL parts. The algebraic parts are processed by the middleware, while the SQL parts are processed by the DBMS. The middleware uses performance feedback from the DBMS to adapt its partitioning of subsequent queries into middleware and DBMS parts. The paper describes the architecture and implementation of the temporal middleware component, termed TANGO, which is based on the Volcano extensible query optimizer and the XXL query processing library. Experiments with the system demonstrate the utility of the middleware's internal processing capability and its cost-based mechanism for apportioning the processing between the middleware and the underlying DBMS." SIGMOD Conference MPEG-7 Standard for Multimedia Databases. John R. Smith 2001 MPEG-7 Standard for Multimedia Databases. SIGMOD Conference Updating XML. Igor Tatarinov,Zachary G. Ives,Alon Y. Halevy,Daniel S. Weld 2001 As XML has developed over the past few years, its role has expanded beyond its original domain as a semantics-preserving markup language for online documents, and it is now also the de facto format for interchanging data between heterogeneous systems. Data sources expert XML “views” over their data, and other system can directly import or query these views. As a result, there has been great interest in languages and systems for expressing queries over XML data, whether the XML is stored in a repository or generated as a view over some other data storage format. Clearly, in order to fully evolve XML into a universal data representation and sharing format, we must allow users to specify updates to XML documents and must develop techniques to process them efficiently. Update capabilities are important not only for modifying XML documents, but also for propagating changes through XML view and for expressing and transmitting changes to documents. This paper begins by proposing a set of basic update operations for both ordered and unordered XML data. We next describe extensions to the proposed standard XML query language, XQuery, to incorporate the update operations. We then consider alternative methods for implementing update operations when the XML data is mapped into a relational database. Finally, we describe an experimental evaluation of the alternative techniques for implementing our extensions. SIGMOD Conference Content Integration for E-Business. Michael Stonebraker,Joseph M. Hellerstein 2001 We define the problem of content integration for E-Business, and show how it differs in fundamental ways from traditional issues surrounding data integration, application integration, data warehousing and OLTP. Content integration includes catalog integration as a special case, but encompasses a broader set of applications and challenges. We explore the characteristics of content integration and required services for any solution. In addition, we explore architectural alternatives and discuss the use of XML in this arena. SIGMOD Conference "Lots o' Ticks: Real-Time High Performance Time Series Queries on Billions of Trades and Quotes." Arthur T. Whitney,Dennis Shasha 2001 Financial mathematicians think they can predict the future by looking at time series of trades and quotes (called ticks) from the past. The main evidence for this hypothesis is that prices fluctuate only by a small amount in a given day and more or less obey the mathematics of a random walk. The hypothesis allows traders to price options and to speculate on stocks. This demonstration presents a query language and a parallel database (50-way parallelism) to support traders who want to analyze every tick, not just end-of-day ticks, using temporal statistical queries such as time-delayed correlations and tick trends. This is the first attempt that we know of to store and analyze hundreds of gigabytes of time series data and to query that data using a declarative time series extension to SQL (available at www.kx.com). SIGMOD Conference Using the Golden Rule of Sampling for Query Estimation. Yi-Leh Wu,Divyakant Agrawal,Amr El Abbadi 2001 Using the Golden Rule of Sampling for Query Estimation. SIGMOD Conference The Network is the Database: Data Management for Highly Distributed Systems. Julio C. Navas,Michael J. Wynblatt 2001 This paper describes the methodology and implementation of a data management system for highly distributed systems, which was built to solve the scalability and reliability problems faced in a wide area postal logistics application developed at Siemens. The core of the approach is to borrow from Internet routing protocols, and their proven scalability and robustness, to build a network-embedded dynamic database index, and to augment schema definition to optimize the use of this index. The system was developed with an eye toward future applications in the area of sensor networks. SIGMOD Conference Data-Driven Understanding and Refinement of Schema Mappings. Ling-Ling Yan,Renée J. Miller,Laura M. Haas,Ronald Fagin 2001 At the heart of many data-intensive applications is the problem of quickly and accurately transforming data into a new form. Database researchers have long advocated the use of declarative queries for this process. Yet tools for creating, managing and understanding the complex queries necessary for data transformation are still too primitive to permit widespread adoption of this approach. We present a new framework that uses data examples as the basis for understanding and refining declarative schema mappings. We identify a small set of intuitive operators for manipulating examples. These operators permit a user to follow and refine an example by walking through a data source. We show that our operators are powerful enough both to identify a large class of schema mappings and to distinguish effectively between alternative schema mappings. These operators permit a user to quickly and intuitively build and refine complex data transformation queries that map one data source into another. SIGMOD Conference Efficient and Effective Metasearch for Text Databases Incorporating Linkages among Documents. Clement T. Yu,Weiyi Meng,Wensheng Wu,King-Lup Liu 2001 Linkages among documents have a significant impact on the importance of documents, as it can be argued that important documents are pointed to by many documents or by other important documents. Metasearch engines can be used to facilitate ordinary users for retrieving information from multiple local sources (text databases). There is a search engine associated with each database. In a large-scale metasearch engine, the contents of each local database is represented by a representative. Each user query is evaluated against he set of representatives of all databases in order to determine the appropriate databases (search engines) to search (invoke) In previous word, the linkage information between documents has not been utilized in determining the appropriate databases to search. In this paper, such information is employed to determine the degree of relevance of a document with respect to a given query. Specifically, the importance (rank) of each document as determined by the linkages is integrated in each database representative to facilitate the selection of databases for each given query. We establish a necessary and sufficient condition to rank databases optimally, while incorporating the linkage information. A method is provided to estimate the desired quantities stated in the necessary and sufficient condition. The estimation method runs in time linearly proportional to the number of query terms. Experimental results are provided to demonstrate the high retrieval effectiveness of the method. SIGMOD Conference On Supporting Containment Queries in Relational Database Management Systems. Chun Zhang,Jeffrey F. Naughton,David J. DeWitt,Qiong Luo,Guy M. Lohman 2001 Virtually all proposals for querying XML include a class of query we term “containment queries”. It is also clear that in the foreseeable future, a substantial amount of XML data will be stored in relational database systems. This raises the question of how to support these containment queries. The inverted list technology that underlies much of Information Retrieval is well-suited to these queries, but should we implement this technology (a) in a separate loosely-coupled IR engine, or (b) using the native tables and query execution machinery of the RDBMS? With option (b), more than twenty years of work on RDBMS query optimization, query execution, scalability, and concurrency control and recovery immediately extend to the queries and structures that implement these new operations. But all this will be irrelevant if the performance of option (b) lags that of (a) by too much. In this paper, we explore some performance implications of both options using native implementations in two commercial relational database systems and in a special purpose inverted list engine. Our performance study shows that while RDBMSs are generally poorly suited for such queries, under conditions they can outperform an inverted list engine. Our analysis further identifies two significant causes that differentiate the performance of the IR and RDBMS implementations: the join algorithms employed and the hardware cache utilization. Our results suggest that contrary to most expectations, with some modifications, a native implementations in an RDBMS can support this class of query much more efficiently. SIGMOD Conference Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001 Sharad Mehrotra,Timos K. Sellis 2001 Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001 VLDB The WorlInfo Assistant: Spatio-Temporal Information Integration on the Web. José Luis Ambite,Craig A. Knoblock,Mohammad R. Kolahdouzan,Maria Muslea,Cyrus Shahabi,Snehal Thakkar 2001 The WorlInfo Assistant: Spatio-Temporal Information Integration on the Web. VLDB Analyzing Quantitative Databases: Image is Everything. Amihood Amir,Reuven Kashi,Nathan S. Netanyahu 2001 Analyzing Quantitative Databases: Image is Everything. VLDB Analyzing energy behavior of spatial access methods for memory-resident data. Ning An,Anand Sivasubramaniam,Narayanan Vijaykrishnan,Mahmut T. Kandemir,Mary Jane Irwin,Sudhanva Gurumurthi 2001 Analyzing energy behavior of spatial access methods for memory-resident data. VLDB PicoDBMS: Validation and Experience. Nicolas Anciaux,Christophe Bobineau,Luc Bouganim,Philippe Pucheral,Patrick Valduriez 2001 PicoDBMS: Validation and Experience. VLDB User-Optimizer Communication using Abstract Plans in Sybase ASE. Mihnea Andrei,Patrick Valduriez 2001 User-Optimizer Communication using Abstract Plans in Sybase ASE. VLDB DB2 Spatial Extender - Spatial data within the RDBMS. David W. Adler 2001 DB2 Spatial Extender - Spatial data within the RDBMS. VLDB Storage and Querying of E-Commerce Data. Rakesh Agrawal,Amit Somani,Yirong Xu 2001 Storage and Querying of E-Commerce Data. VLDB Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. Ashraf Aboulnaga,Alaa R. Alameldeen,Jeffrey F. Naughton 2001 Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. VLDB Navigating large-scale semi-structured data in business portals. Mani Abrol,Neil Latarche,Uma Mahadevan,Jianchang Mao,Rajat Mukherjee,Prabhakar Raghavan,Michel Tourn,John Wang,Grace Zhang 2001 Navigating large-scale semi-structured data in business portals. VLDB Weaving Relations for Cache Performance. Anastassia Ailamaki,David J. DeWitt,Mark D. Hill,Marios Skounakis 2001 Weaving Relations for Cache Performance. VLDB Data Staging for On-Demand Broadcast. Demet Aksoy,Michael J. Franklin,Stanley B. Zdonik 2001 Data Staging for On-Demand Broadcast. VLDB Visual Web Information Extraction with Lixto. Robert Baumgartner,Sergio Flesca,Georg Gottlob 2001 Visual Web Information Extraction with Lixto. VLDB Supervised Wrapper Generation with Lixto. Robert Baumgartner,Sergio Flesca,Georg Gottlob 2001 Supervised Wrapper Generation with Lixto. VLDB An Evaluation of Generic Bulk Loading Techniques. Jochen Van den Bercken,Bernhard Seeger 2001 An Evaluation of Generic Bulk Loading Techniques. VLDB Flexible and scalable digital library search. Henk Ernst Blok,Menzo Windhouwer,Roelof van Zwol,Milan Petkovic,Peter M. G. Apers,Martin L. Kersten,Willem Jonker 2001 Flexible and scalable digital library search. VLDB Indexing and Querying XML Data for Regular Path Expressions. Quanzhong Li,Bongki Moon 2001 Indexing and Querying XML Data for Regular Path Expressions. VLDB Fast Evaluation Techniques for Complex Similarity Queries. Klemens Böhm,Michael Mlivoncic,Hans-Jörg Schek,Roger Weber 2001 Fast Evaluation Techniques for Complex Similarity Queries. VLDB Warehousing Workflow Data: Challenges and Opportunities. Angela Bonifati,Fabio Casati,Umeshwar Dayal,Ming-Chien Shan 2001 Warehousing Workflow Data: Challenges and Opportunities. VLDB Ontology-based Support for Digital Government. Athman Bouguettaya,Ahmed K. Elmagarmid,Brahim Medjahed,Mourad Ouzzani 2001 Ontology-based Support for Digital Government. VLDB Aggregate Maintenance for Data Warehousing in Informix Red Brick Vista. Craig J. Bunker,Latha S. Colby,Richard L. Cole,William J. McKenna,Gopal Mulagund,David Wilhite 2001 Aggregate Maintenance for Data Warehousing in Informix Red Brick Vista. VLDB LoPiX: A System for XML Data Integration and Manipulation. Wolfgang May 2001 LoPiX: A System for XML Data Integration and Manipulation. VLDB Online Scaling in a Highly Available Database. Svein Erik Bratsberg,Rune Humborstad 2001 Online Scaling in a Highly Available Database. VLDB The Propel Distributed Services Platform. Michael J. Carey,Steve Kirsch,Mary Roth,Bert Van der Linden,Nicolas Adiba,Michael Blow,Daniela Florescu,David Li,Ivan Oprencak,Rajendra Panwar,Runping Qi,David Rieber,John C. Shafer,Brian Sterling,Tolga Urhan,Brian Vickery,Dan Wineman,Kuan Yee 2001 The Propel Distributed Services Platform. VLDB Improving Business Process Quality through Exception Understanding, Prediction, and Prevention. Daniela Grigori,Fabio Casati,Umeshwar Dayal,Ming-Chien Shan 2001 Improving Business Process Quality through Exception Understanding, Prediction, and Prevention. VLDB Operating System Extensions for the Teradata Parallel VLDB. John Catozzi,Sorana Rabinovici 2001 Operating System Extensions for the Teradata Parallel VLDB. VLDB A Prototype Content-Based Retrieval System that Uses Virtual Images to Save Space. Leonard Brown,Le Gruenwald 2001 A Prototype Content-Based Retrieval System that Uses Virtual Images to Save Space. VLDB Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. Sang Kyun Cha,Sangyong Hwang,Kihong Kim,Keunjoo Kwon 2001 Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. VLDB Efficient Management of Multiversion Documents by Object Referencing. Shu-Yao Chien,Vassilis J. Tsotras,Carlo Zaniolo 2001 Efficient Management of Multiversion Documents by Object Referencing. VLDB Ambient Intelligence with the Ubiquitous Network, the Embedded Computer Devices and the Hidden Databases (abstract). Egbert-Jan Sol 2001 Ambient Intelligence with the Ubiquitous Network, the Embedded Computer Devices and the Hidden Databases (abstract). VLDB A Formal Perspective on the View Selection Problem. Rada Chirkova,Alon Y. Halevy,Dan Suciu 2001 The view selection problem is to choose a set of views to materialize over a database schema, such that the cost of evaluating a set of workload queries is minimized and such that the views fit into a prespecified storage constraint. The two main applications of the view selection problem are materializing views in a database to speed up query processing, and selecting views to materialize in a data warehouse to answer decision support queries. In addition, view selection is a core problem for intelligent data placement over a wide-area network for data integration applications and data management for ubiquitous computing. We describe several fundamental results concerning the view selection problem. We consider the problem for views and workloads that consist of equality-selection, project and join queries, and show that the complexity of the problem depends crucially on the quality of the estimates that a query optimizer has on the size of the views it is considering to materialize. When a query optimizer has good estimates of the sizes of the views, we show a somewhat surprising result, namely, that an optimal choice of views may involve a number of views that is exponential in the size of the database schema. On the other hand, when an optimizer uses standard estimation heuristics, we show that the number of necessary views and the expression size of each view are polynomially bounded. VLDB Data Management for Pervasive Computing. Mitch Cherniack,Michael J. Franklin,Stanley B. Zdonik 2001 Data Management for Pervasive Computing. VLDB Storage and Retrieval of XML Data Using Relational Databases. Surajit Chaudhuri,Kyuseok Shim 2001 Storage and Retrieval of XML Data Using Relational Databases. VLDB Dynamic Update Cube for Range-sum Queries. Seok-Ju Chun,Chin-Wan Chung,Ju-Hong Lee,Seok-Lyong Lee 2001 Dynamic Update Cube for Range-sum Queries. VLDB A Fast Index for Semistructured Data. Brian F. Cooper,Neal Sample,Michael J. Franklin,Gísli R. Hjaltason,Moshe Shadmon 2001 A Fast Index for Semistructured Data. VLDB Lineage Tracing for General Data Warehouse Transformations. Yingwei Cui,Jennifer Widom 2001 Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing. VLDB Self-similarity in the Web. Stephen Dill,Ravi Kumar,Kevin S. McCurley,Sridhar Rajagopalan,D. Sivakumar,Andrew Tomkins 2001 "Algorithmic tools for searching and mining the Web are becoming increasingly sophisticated and vital. In this context, algorithms that use and exploit structural information about the Web perform better than generic methods in both efficiency and reliability.We present an extensive characterization of the graph structure of the Web, with a view to enabling high-performance applications that make use of this structure. In particular, we show that the Web emerges as the outcome of a number of essentially independent stochastic processes that evolve at various scales. A striking consequence of this scale invariance is that the structure of the Web is ""fractal""---cohesive subregions display the same characteristics as the Web at large. An understanding of this underlying fractal nature is therefore applicable to designing data services across multiple domains and scales.We describe potential applications of this line of research to optimized algorithm design for Web-scale data analysis." VLDB Mining Multi-Dimensional Constrained Gradients in Data Cubes. Guozhu Dong,Jiawei Han,Joyce M. W. Lam,Jian Pei,Ke Wang 2001 Mining Multi-Dimensional Constrained Gradients in Data Cubes. VLDB A Comparative Study of Alternative Middle Tier Caching Solutions to Support Dynamic Web Content Acceleration. Anindya Datta,Kaushik Dutta,Helen M. Thomas,Debra E. VanderMeer,Krithi Ramamritham,Dan Fishman 2001 A Comparative Study of Alternative Middle Tier Caching Solutions to Support Dynamic Web Content Acceleration. VLDB Discovering Web Services: An Overview. Vadim Draluk 2001 Discovering Web Services: An Overview. VLDB The Long-Term Preservation of Authentic Electronic Records. Luciana Duranti 2001 The Long-Term Preservation of Authentic Electronic Records. VLDB Business Process Coordination: State of the Art, Trends, and Open Issues. Umeshwar Dayal,Meichun Hsu,Rivka Ladin 2001 Business Process Coordination: State of the Art, Trends, and Open Issues. VLDB Query Engines for Web-Accessible XML Data. Leonidas Fegaras,Ramez Elmasri 2001 Query Engines for Web-Accessible XML Data. VLDB Italian Electronic Identity Card - principle and architecture. Mario Gentili 2001 Italian Electronic Identity Card - principle and architecture. VLDB Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. Phillip B. Gibbons 2001 Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. VLDB Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. Anna C. Gilbert,Yannis Kotidis,S. Muthukrishnan,Martin Strauss 2001 Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. VLDB Declarative Data Cleaning: Language, Model, and Algorithms. Helena Galhardas,Daniela Florescu,Dennis Shasha,Eric Simon,Cristian-Augustin Saita 2001 Declarative Data Cleaning: Language, Model, and Algorithms. VLDB Approximate Query Processing: Taming the TeraBytes. Minos N. Garofalakis,Phillip B. Gibbons 2001 Approximate Query Processing: Taming the TeraBytes. VLDB Approximate String Joins in a Database (Almost) for Free. Luis Gravano,Panagiotis G. Ipeirotis,H. V. Jagadish,Nick Koudas,S. Muthukrishnan,Divesh Srivastava 2001 Approximate String Joins in a Database (Almost) for Free. VLDB A Database Index to Large Biological Sequences. Ela Hunt,Malcolm P. Atkinson,Robert W. Irving 2001 A Database Index to Large Biological Sequences. VLDB Scientific OLAP for the Biotech Domain. Nam Huyn 2001 Scientific OLAP for the Biotech Domain. VLDB Transaction Timestamping in (Temporal) Databases. Christian S. Jensen,David B. Lomet 2001 Transaction Timestamping in (Temporal) Databases. VLDB Efficient Index Structures for String Databases. Tamer Kahveci,Ambuj K. Singh 2001 Efficient Index Structures for String Databases. VLDB SMOOTH - A Distributed Multimedia Database System. Harald Kosch,László Böszörményi,Alexander Bachlechner,Christian Hanin,Christian Hofbauer,Margit Lang,Carmen Riedler,Roland Tusch 2001 SMOOTH - A Distributed Multimedia Database System. VLDB A Data Warehousing Architecture for Enabling Service Provisioning Process. Yannis Kotidis 2001 A Data Warehousing Architecture for Enabling Service Provisioning Process. VLDB Update Propagation Strategies for Improving the Quality of Data on the Web. Alexandros Labrinidis,Nick Roussopoulos 2001 Update Propagation Strategies for Improving the Quality of Data on the Web. VLDB Cache Fusion: Extending Shared-Disk Clusters with Shared Caches. Tirthankar Lahiri,Vinay Srihari,Wilson Chan,N. MacNaughton,Sashikanth Chandrasekaran 2001 Cache Fusion: Extending Shared-Disk Clusters with Shared Caches. VLDB Managing Business Processes via Workflow Technology. Frank Leymann 2001 Managing Business Processes via Workflow Technology. VLDB Cache Portal: Technology for Accelerating Database-driven e-commerce Web Sites. Wen-Syan Li,K. Selçuk Candan,Wang-Pin Hsiung,Oliver Po,Divyakant Agrawal,Qiong Luo,Wei-Kuang Waine Huang,Yusuf Akca,Cemal Yilmaz 2001 Cache Portal: Technology for Accelerating Database-driven e-commerce Web Sites. VLDB Form-Based Proxy Caching for Database-Backed Web Sites. Qiong Luo,Jeffrey F. Naughton 2001 Form-Based Proxy Caching for Database-Backed Web Sites. VLDB Generic Schema Matching with Cupid. Jayant Madhavan,Philip A. Bernstein,Erhard Rahm 2001 Generic Schema Matching with Cupid. VLDB ACTIVIEW: Adaptive data presentation using SuperSQL. Yoko Maeda,Motomichi Toyama 2001 ACTIVIEW: Adaptive data presentation using SuperSQL. VLDB Answering XML Queries on Heterogeneous Data Sources. Ioana Manolescu,Daniela Florescu,Donald Kossmann 2001 Answering XML Queries on Heterogeneous Data Sources. VLDB CP: Clustering based on Closest Pairs. Alexandros Nanopoulos,Yannis Theodoridis,Yannis Manolopoulos 2001 CP: Clustering based on Closest Pairs. VLDB NetCube: A Scalable Tool for Fast Data Mining and Compression. Dimitris Margaritis,Christos Faloutsos,Sebastian Thrun 2001 NetCube: A Scalable Tool for Fast Data Mining and Compression. VLDB On Processing XML in LDAP. Pedro José Marrón,Georg Lausen 2001 On Processing XML in LDAP. VLDB RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Valter Crescenzi,Giansalvatore Mecca,Paolo Merialdo 2001 RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB Change-Centric Management of Versions in an XML Warehouse. Amélie Marian,Serge Abiteboul,Gregory Cobena,Laurent Mignet 2001 Change-Centric Management of Versions in an XML Warehouse. VLDB Caching Technologies for Web Applications. C. Mohan 2001 Caching Technologies for Web Applications. VLDB Enabling End-users to Construct Data-intensive Web-sites from XML Repositories: An Example-based Approach. Atsuyuki Morishima,Seiichi Koizumi,Hiroyuki Kitagawa,Satoshi Takano 2001 Enabling End-users to Construct Data-intensive Web-sites from XML Repositories: An Example-based Approach. VLDB Tavant System Architecture for Sell-side Channel Management. Srinivasa Narayanan,Subbu N. Subramanian 2001 Tavant System Architecture for Sell-side Channel Management. VLDB Supporting Incremental Join Queries on Ranked Inputs. Apostol Natsev,Yuan-Chi Chang,John R. Smith,Chung-Sheng Li,Jeffrey Scott Vitter 2001 Supporting Incremental Join Queries on Ranked Inputs. VLDB Functional Properties of Information Filtering. Rie Sawai,Masahiko Tsukamoto,Yin-Huei Loh,Tsutomu Terada,Shojiro Nishio 2001 Functional Properties of Information Filtering. VLDB French government activity in the conservation of data and electronic documents. Serge Novaretti 2001 French government activity in the conservation of data and electronic documents. VLDB Collaborative Analytical Processing - Dream or Reality? (Panel abstract). "William O'Connell,Andrew Witkowski,Goetz Graefe" 2001 Collaborative Analytical Processing - Dream or Reality? (Panel abstract). VLDB Indexing the Distance: An Efficient Method to KNN Processing. Cui Yu,Beng Chin Ooi,Kian-Lee Tan,H. V. Jagadish 2001 Indexing the Distance: An Efficient Method to KNN Processing. VLDB Storage Service Providers: a Solution for Storage Management? (Panel). Banu Özden,Eran Gabber,Bruce Hillyer,Wee Teck Ng,Elizabeth A. M. Shriver,David J. DeWitt,Bruce Gordon,Jim Gray,John Wilkes 2001 Storage Service Providers: a Solution for Storage Management? (Panel). VLDB An Extendible Hash for Multi-Precision Similarity Querying of Image Databases. Shu Lin,M. Tamer Özsu,Vincent Oria,Raymond T. Ng 2001 An Extendible Hash for Multi-Precision Similarity Querying of Image Databases. VLDB MV3R-Tree: A Spatio-Temporal Access Method for Timestamp and Interval Queries. Yufei Tao,Dimitris Papadias 2001 MV3R-Tree: A Spatio-Temporal Access Method for Timestamp and Interval Queries. VLDB Information Management for Genome Level Bioinformatics. Norman W. Paton,Carole A. Goble 2001 Information Management for Genome Level Bioinformatics. VLDB WebFilter: A High-throughput XML-based Publish and Subscribe System. João Pereira,Françoise Fabret,Hans-Arno Jacobsen,François Llirbat,Dennis Shasha 2001 WebFilter: A High-throughput XML-based Publish and Subscribe System. VLDB Crawling the Hidden Web. Sriram Raghavan,Hector Garcia-Molina 2001 Crawling the Hidden Web. VLDB "Potter's Wheel: An Interactive Data Cleaning System." Vijayshankar Raman,Joseph M. Hellerstein 2001 "Potter's Wheel: An Interactive Data Cleaning System." VLDB A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services. Reza Sadri,Carlo Zaniolo,Amir M. Zarkesh,Jafar Adibi 2001 A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services. VLDB Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation. Yasushi Sakurai,Masatoshi Yoshikawa,Ryoji Kataoka,Shunsuke Uemura 2001 Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation. VLDB Intelligent Rollups in Multidimensional OLAP Data. Gayatri Sathe,Sunita Sarawagi 2001 Intelligent Rollups in Multidimensional OLAP Data. VLDB XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries. Jochen Van den Bercken,Björn Blohsfeld,Jens-Peter Dittrich,Jürgen Krämer,Tobias Schäfer,Martin Schneider,Bernhard Seeger 2001 XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries. VLDB Querying XML Views of Relational Data. Jayavel Shanmugasundaram,Jerry Kiernan,Eugene J. Shekita,Catalina Fan,John E. Funderburk 2001 Querying XML Views of Relational Data. VLDB Architectures for Internal Web Services Deployment. Oded Shmueli 2001 Architectures for Internal Web Services Deployment. VLDB SIT-IN: a Real-Life Spatio-Temporal Information System. Giuseppe Sindoni,Leonardo Tininini,Amedea Ambrosetti,Cristina Bedeschi,Stefano De Francisci,Orietta Gargano,Rossella Molinaro,Mario Paolucci,Paola Patteri,Pina Ticca 2001 SIT-IN: a Real-Life Spatio-Temporal Information System. VLDB The Semantic Web Paving the Way to the Knowledge Society. Pierre-Paul Sondag 2001 The Semantic Web Paving the Way to the Knowledge Society. VLDB Discovery of Influence Sets in Frequently Updated Databases. Ioana Stanoi,Mirek Riedewald,Divyakant Agrawal,Amr El Abbadi 2001 Discovery of Influence Sets in Frequently Updated Databases. VLDB "LEO - DB2's LEarning Optimizer." Michael Stillger,Guy M. Lohman,Volker Markl,Mokhtar Kandil 2001 "LEO - DB2's LEarning Optimizer." VLDB WARLOCK: A Data Allocation Tool for Parallel Warehouses. Thomas Stöhr,Erhard Rahm 2001 WARLOCK: A Data Allocation Tool for Parallel Warehouses. VLDB Developing an Indexing Scheme for XML Document Collection using the Oracle8i Extensibility Framework. Seema Sundara,Ying Hu,Timothy Chorma,Nipun Agarwal,Jagannathan Srinivasan 2001 Developing an Indexing Scheme for XML Document Collection using the Oracle8i Extensibility Framework. VLDB Efficient Progressive Skyline Computation. Kian-Lee Tan,Pin-Kwang Eng,Beng Chin Ooi 2001 Efficient Progressive Skyline Computation. VLDB Walking Through a Very Large Virtual Environment in Real-time. Lidan Shou,Jason Chionh,Zhiyong Huang,Yixin Ruan,Kian-Lee Tan 2001 Walking Through a Very Large Virtual Environment in Real-time. VLDB Are Web Services the Next Revolution in e-Commerce? (Panel). Shalom Tsur,Serge Abiteboul,Rakesh Agrawal,Umeshwar Dayal,Johannes Klein,Gerhard Weikum 2001 Are Web Services the Next Revolution in e-Commerce? (Panel). VLDB Dynamic Pipeline Scheduling for Improving Interactive Query Performance. Tolga Urhan,Michael J. Franklin 2001 Dynamic Pipeline Scheduling for Improving Interactive Query Performance. VLDB Views in a Large Scale XML Repository. Sophie Cluet,Pierangelo Veltri,Dan Vodislav 2001 Views in a Large Scale XML Repository. VLDB FeedbackBypass: A New Approach to Interactive Similarity Query Processing. Ilaria Bartolini,Paolo Ciaccia,Florian Waas 2001 FeedbackBypass: A New Approach to Interactive Similarity Query Processing. VLDB Et tu, XML? The downfall of the relational empire (abstract). Philip Wadler 2001 Et tu, XML? The downfall of the relational empire (abstract). VLDB Hyperqueries: Dynamic Distributed Query Processing on the Internet. Alfons Kemper,Christian Wiesner 2001 Hyperqueries: Dynamic Distributed Query Processing on the Internet. VLDB Comparing Hybrid Peer-to-Peer Systems. Beverly Yang,Hector Garcia-Molina 2001 Comparing Hybrid Peer-to-Peer Systems. VLDB VXMLR: A Visual XML-Relational Database System. Aoying Zhou,Hongjun Lu,Shihui Zheng,Yuqi Liang,Long Zhang,Wenyun Ji,Zengping Tian 2001 VXMLR: A Visual XML-Relational Database System. VLDB AgFlow: Agent-based Cross-Enterprise Workflow Management System. Liangzhao Zeng,Boualem Benatallah,Phuong Nguyen,Anne H. H. Ngu 2001 AgFlow: Agent-based Cross-Enterprise Workflow Management System. SIGMOD Record "Advanced XML Data Processing - Guest Editor's Introduction." Karl Aberer 2001 "Advanced XML Data Processing - Guest Editor's Introduction." SIGMOD Record Book Review Column. Karl Aberer 2001 This is the last issue of the book review column that will appear under my responsibility. I would like to thank here all authors of book reviews for their very interesting contributions over the last four years. I also hope the readers of SIGMOD RECORD found the articles in this column of interest and that they motivated in some cases to have a more detailed look at one of the reviewed books. For me it was surely an interesting experience seeing which new books have arrived over the last four years and also how they are appreciated by the community. SIGMOD Record Report on the 2001 SIGMOD and PODS Awards. Philip A. Bernstein 2001 Report on the 2001 SIGMOD and PODS Awards. SIGMOD Record Re-designing Distance Functions and Distance-Based Applications for High Dimensional Data. Charu C. Aggarwal 2001 In recent years, the detrimental effects of the curse of high dimensionality have been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from the performance perspective. Recent research results show that in high dimensional space, the concept of proximity may not even be qualitatively meaningful [6]. In this paper, we try to outline the effects of generalizing low dimensional techniques to high dimensional applications and the natural effects of sparsity on distance based applications. We outline the guidelines required in order to re-design either the distance functions or the distance-based applications in a meaningful way for high dimensional domains. We provide novel perspectives and insights on some new lines of work for broadening application definitions in order to effectively deal with the dimensionality curse. SIGMOD Record Quality of Service in Multimedia Digital Libraries. Elisa Bertino,Ahmed K. Elmagarmid,Mohand-Said Hacid 2001 There is currently considerable interest in developing multimedia digital libraries. However, it has become clear that existing architectures for management systems do not support the particular requirements of continuous media types. This is particularly the case in the important area of quality of service support. In this correspondence, we discuss quality of service issues within digital libraries and present a reference architecture able to support some quality aspects. SIGMOD Record EIHA?!?: Deploying Web and WAP services using XML Technology. Chiara Biancheri,Jean-Christophe R. Pazzaglia,Gavino Paddeu 2001 The exponential growth of resources on the web, and the wide deployment of devices for multimodal access to the Internet, lead to new problems in information management. In this context, and as part of the European project Vision, we have built an interactive telematic handbook of the culture and the territory of Sardinia. A team of cultural experts browsed the web to get a large collection of Internet resources.The system built for the management of this data uses emerging Internet technologies such as the XML language suite and its applications. The result obtained is a multimodal service, called Eiha?!?, available through PCs and mobile phones. SIGMOD Record Continuous Queries over Data Streams. Shivnath Babu,Jennifer Widom 2001 In many recent applications, data may take the form of continuous data streams, rather than finite stored data sets. Several aspects of data management need to be reconsidered in the presence of data streams, offering a new research direction for the database community. In this paper we focus primarily on the problem of query processing, specifically on how to define and evaluate continuous queries over data streams. We address semantic issues as well as efficiency concerns. Our main contributions are threefold. First, we specify a general and flexible architecture for query processing in the presence of data streams. Second, we use our basic architecture as a tool to clarify alternative semantics and processing techniques for continuous queries. The architecture also captures most previous work on continuous queries and data streams, as well as related concepts such as triggers and materialized views. Finally, we map out research topics in the area of query processing over data streams, showing where previous work is relevant and describing problems yet to be addressed. SIGMOD Record "Special Issue on Data Mining for Intrusion Detection and Threat Analysis - Guest Editor's Introduction." Daniel Barbará 2001 "Special Issue on Data Mining for Intrusion Detection and Threat Analysis - Guest Editor's Introduction." SIGMOD Record ADAM: A Testbed for Exploring the Use of Data Mining in Intrusion Detection. Daniel Barbará,Julia Couto,Sushil Jajodia,Ningning Wu 2001 Intrusion detection systems have traditionally been based on the characterization of an attack and the tracking of the activity on the system to see if it matches that characterization. Recently, new intrusion detection systems based on data mining are making their appearance in the field. This paper describes the design and experiences with the ADAM (Audit Data Analysis and Mining) system, which we use as a testbed to study how useful data mining techniques can be in intrusion detection. SIGMOD Record Report on The fourth International Conference on Flexible Query Answering systems. Patrick Bosc,Amihai Motro,Gabriella Pasi 2001 Report on The fourth International Conference on Flexible Query Answering systems. SIGMOD Record Constraints for Semi-structured Data and XML. Peter Buneman,Wenfei Fan,Jérôme Siméon,Scott Weinstein 2001 Constraints for Semi-structured Data and XML. SIGMOD Record XML Document Versioning. Shu-Yao Chien,Vassilis J. Tsotras,Carlo Zaniolo 2001 Managing multiple versions of XML documents represents an important problem, because of many applications ranging from traditional ones, such as software configuration control, to new ones, such as link permanence of web documents. Research on managing multiversion XML documents seeks to provide efficient and robust techniques for (i) storing and retrieving, (ii) exchanging, and (iii) querying such documents. In this paper, we first show that traditional version control methods, such as RCS, and SCCS, fall short from satisfying these three requirements, and discuss alternative solutions. First, we enhance RCS with a temporal page clustering policy to achieve objective (i). Then, we discuss a reference-based versioning scheme that achieves both objectives (i) and (ii) and is also effective at supporting simple queries. The topic of supporting complex queries, including temporal ones, meshes with the burgeoning interest of database researchers in XML as a database description language, and in XML query languages. In this context, the XML versioning problems are akin to those of transaction time management for databases of objects and semistructured information. Nevertheless, the need to preserve the natural ordering of XML documents frequently requires different techniques. SIGMOD Record Detection and Classification of Intrusions and Faults using Sequences of System Calls. João B. D. Cabrera,Lundy M. Lewis,Raman K. Mehra 2001 "This paper investigates the use of sequences of system calls for classifying intrusions and faults induced by privileged processes in Unix. Classification is an essential capability for responding to an anomaly (attack or fault), since it gives the ability to associate appropriate responses to each anomaly type. Previous work using the well known dataset from the University of New Mexico (UNM) has demonstrated the usefulness of monitoring sequences of system calls for detecting anomalies induced by processes corresponding to several Unix Programs, such as sendmail, lpr, ftp, etc. Specifically, previous work has shown that the Anomaly Count of a running process, i.e., the number of sequences spawned by the process which are not found in the corresponding dictionary of normal activity for the Program, is a valuable feature for anomaly detection. To achieve Classification, in this paper we introduce the concept of Anomaly Dictionaries, which are the sets of anomalous sequences for each type of anomaly. It is verified that Anomaly Dictionaries for the UNM's sendmail Program have very little overlap, and can be effectively used for Anomaly Classification. The sequences in the Anomalous Dictionary enable a description of Self for the Anomalies, analogous to the definition of Self for Privileged Programs given by the Normal Dictionaries. The dependence of Classification Accuracy with sequence length is also discussed. As a side result, it is also shown that a hybrid scheme, combining the proposed classification strategy with the original Anomaly Counts can lead to a substantial improvement in the overall detection rates for the sendmail dataset. The methodology proposed is rather general, and can be applied to any situation where sequences of symbols provide an effective characterization of a phenomenon." SIGMOD Record Describing Semistructured Data. Luca Cardelli 2001 We introduce a rich language of descriptions for semistructured tree-like data, and we explain how such descriptions relate to the data they describe. Various query languages and data schemas can be based on such descriptions. SIGMOD Record XML and Information Retrieval: a SIGIR 2000 Workshop. David Carmel,Yoëlle S. Maarek,Aya Soffer 2001 XML and Information Retrieval: a SIGIR 2000 Workshop. SIGMOD Record On the Academic Interview Circuit: An End-to-End Discussion. Ugur Çetintemel 2001 On the Academic Interview Circuit: An End-to-End Discussion. SIGMOD Record Semantic Web Workshop: Models, Architectures and Management. Panos Constantopoulos,Vassilis Christophides,Dimitris Plexousakis 2001 Semantic Web Workshop: Models, Architectures and Management. SIGMOD Record Report on XEWA-00: The XML Enabled Wide-Area Searches for Bioinformatics Workshop. Terence Critchlow 2001 The XEWA-00 workshop, held in December 2000 and sponsored by the IEEE Computer Society, was organized to bring together members of the bioinformatics community to determine if XML could simplify accessing large, heterogeneous, distributed collections of web-based data sources. The starting point for a series of breakout and group discussions was a proposed strawman of a grammar that described how to query a data source through its web interface. As a result of these discussions, the approach was validated, the strawman was refined, and several reference implementations are being generated as part of an ongoing effort. This article contains an overview of the workshop, including the proposed approach and a description of the strawman. SIGMOD Record SQL/XML and the SQLX Informal Group of Companies. Andrew Eisenberg,Jim Melton 2001 SQL/XML and the SQLX Informal Group of Companies. SIGMOD Record XQuery Formal Semantics: State and Challenges. Peter Fankhauser 2001 The XQuery formalization is an ongoing effort of the W3C XML Query working group to define a precise formal semantics for XQuery. This paper briefly introduces the current state of the formalization and discusses some of the more demanding remaining challenges in formally describing an expressive query language for XML. SIGMOD Record Towards Knowledge-Based Digital Libraries. Ling Feng,Manfred A. Jeusfeld,Jeroen Hoppenbrouwers 2001 "From the standpoint of satisfying human's information needs, the current digital library (DL) systems suffer from the following two shortcomings: (i) inadequate high-level cognition support; (ii) inadequate knowledge sharing facilities. In this article, we introduce a two-layered digital library architecture to support different levels of human cognitive acts. The model moves beyond simple information searching and browsing across multiple repositories, to inquiry of knowledge about the contents of digital libraries. To address users' high- order cognitive requests, we propose an information space consisting of a knowledge subspace and a document subspace. We extend the traditional indexing and searching schema of digital libraries from keyword-based to knowledge-based by adding knowledge to the documents into the DL information space. The distinguished features of such enhanced DL systems in comparison with the traditional knowledge-based systems are also discussed." SIGMOD Record Data Analysis and Mining in the Life Sciences. Nam Huyn 2001 Biotech companies routinely generate vast amounts of biological measurement data that must be analyzed rapidly and mined for diagnostic, prognostic, or drug evaluation purposes. While these data analysis tasks are critical to their success, they have not benefited from recent advances that emerged from database and KDD research. In this paper, we focus on two such tasks: on-line analysis of clinical study data, and mining broad datasets for biomarkers. We examine the new requirements that are not met by current data analysis technologies and we identify new database and KDD research to address these needs. We describe our experience implementing a Scientific OLAP system and a data mining platform for the support of biomarker discovery at SurroMed, and we outline some key technical challenges that must be overcome before data analysis and data mining technologies can be widely adopted in the biotech industry. SIGMOD Record DIMACS Summer School Tutorial on New Frontiers in Data Mining. Dimitrios Gunopulos,Nick Koudas 2001 DIMACS Summer School Tutorial on New Frontiers in Data Mining. SIGMOD Record Information Warfare and Security - Book Review. H. V. Jagadish 2001 Information Warfare and Security - Book Review. SIGMOD Record "Treasurer's Message." Joachim Hammer 2001 "Treasurer's Message." SIGMOD Record Wrapping Web Data into XML. Wei Han,David Buttler,Calton Pu 2001 The vast majority of online information is part of the World Wide Web. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. However, developing wrappers is slow and labor-intensive. Further, frequent changes on the HTML documents typically require frequent changes in the wrappers. This paper describes XWRAP Elite, a tool to automatically generate robust wrappers. XWRAP breaks down the conversion process into three steps. First, discover where the data is located in an HTML page and separating the data into individual objects. Second, decompose objects into data elements. Third, mark objects and elements in an output format. XWRAP Elite automates the first two steps and minimizes human involvement in marking output data. Our experience shows that XWRAP is able to create useful wrapper software for a wide variety of real world HTML documents. SIGMOD Record "Editor's Notes." Ling Liu 2001 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2001 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2001 "Editor's Notes." SIGMOD Record Report on the 8th International Workshop on Knowledge Representation Meets Databases (KRDB). Maurizio Lenzerini,Daniele Nardi,Werner Nutt,Dan Suciu 2001 Report on the 8th International Workshop on Knowledge Representation Meets Databases (KRDB). SIGMOD Record Introduction to the Career Forum Column. Alexandros Labrinidis 2001 Introduction to the Career Forum Column. SIGMOD Record Career-Enhancing Services at SIGMOD Online. Alexandros Labrinidis,Alberto O. Mendelzon 2001 This article serves three purposes. First of all, to introduce dbjobs, the database of database jobs, and also describe its functionality and architecture. Secondly, to present statistics for the dbgrads system, after 18 months of continuous operation. Finally, to describe exciting future projects for SIGMOD Online. SIGMOD Record SIGMOD 2001 Industry Sessions. Guy M. Lohman,César A. Galindo-Legaria,Michael J. Franklin,Leonard J. Seligman 2001 SIGMOD 2001 Industry Sessions. SIGMOD Record The Evolution of Effective B-tree: Page Organization and Techniques: A Personal Account. David B. Lomet 2001 An under-appreciated facet of index search structures is the importance of high performance search within B-tree internal nodes. Much attention has been focused on improving node fanout, and hence minimizing the tree height [BU77, LL86]. [GG97, Lo98] have discussed the importance of B-tree page size. A recent article [GL2001] discusses internal node architecture, but the subject is buried in a single section of the paper.In this short note, I want to describe the long evolution of good internal node architecture and techniques, including an understanding of what problem was being solved during each of the incremental steps that have led to much improved node organizations. SIGMOD Record Querying Multi-dimensional Data Indexed Using the Hilbert Space-filling Curve. Jonathan K. Lawder,Peter J. H. King 2001 Mapping to one-dimensional values and then using a one-dimensional indexing method has been proposed as a way of indexing multi-dimensional data. Most previous related work uses the Z-Order Curve but more recently the Hilbert Curve has been considered since it has superior clustering properties. Any approach, however, can only be of practical value if there are effective methods for executing range and partial match queries. This paper describes such a method for the Hilbert Curve. SIGMOD Record Mining System Audit Data: Opportunities and Challenges. Wenke Lee,Wei Fan 2001 "Intrusion detection is an essential component of computer security mechanisms. It requires accurate and efficient analysis of a large amount of system and network audit data. It can thus be an application area of data mining. There are several characteristics of audit data: abundant raw data, rich system and network semantics, and ever ""streaming"". Accordingly, when developing data mining approaches, we need to focus on: feature extraction and construction, customization of (general) algorithms according to semantic information, and optimization of execution efficiency of the output models. In this paper, we describe a data mining framework for mining audit data for intrusion detection models. We discuss its advantages and limitations, and outline the open research problems." SIGMOD Record Preservation of Digital Data with Self-Validating, Self-Instantiating Knowledge-Based Archives. Bertram Ludäscher,Richard Marciano,Reagan Moore 2001 "Digital archives are dedicated to the long-term preservation of electronic information and have the mandate to enable sustained access despite rapid technology changes. Persistent archives are confronted with heterogeneous data formats, helper applications, and platforms being used over the lifetime of the archive. This is not unlike the interoperability challenges, for which mediators are devised. To prevent technological obsolescence over time and across platforms, a migration approach for persistent archives is proposed based on an XML infrastructure.We extend current archival approaches that build upon standardized data formats and simple metadata mechanisms for collection management, by involving high-level conceptual models and knowledge representations as an integral part of the archive and the ingestion/migration processes. Infrastructure independence is maximized by archiving generic, executable specifications of (i) archival constraints (i.e., ""model validators""), and (ii) archival transformations that are part of the ingestion process. The proposed architecture facilitates construction of self-validating and self-instantiating knowledge-based archives. We illustrate our overall approach and report on first experiences using a sample collection from a collaboration with the National Archives and Records Administration (NARA)." SIGMOD Record SQL Multimedia and Application Packages (SQL/MM). Jim Melton,Andrew Eisenberg 2001 Regular readers of this column will have become familiar with database language SQL -- indeed, most readers are already familiar with it. We have also discussed the fact that the SQL standard is being published in multiple parts and have even discussed one of those parts in some detail[l].Another standard, based on SQL and its structured user-defined types[2], has been developed and published by the International Organization for Standardization (ISO). This standard, like SQL, is divided into multiple parts (more independent than the parts of SQL, in fact). Some parts of this other standard, known as SQL/MM, have already been published and are currently in revision, while others are still in preparation for initial publication.In this issue, we introduce SQL/MM and review each of its parts, necessarily at a high level. SIGMOD Record SQL and Management of External Data. Jim Melton,Jan-Eike Michels,Vanja Josifovski,Krishna G. Kulkarni,Peter M. Schwarz,Kathy Zeidenstein 2001 "In late 2000, work was completed on yet another part of the SQL standard [1], to which we introduced our readers in an earlier edition of this column [2].Although SQL database systems manage an enormous amount of data, it certainly has no monopoly on that task. Tremendous amounts of data remain in ordinary operating system files, in network and hierarchical databases, and in other repositories. The need to query and manipulate that data alongside SQL data continues to grow. Database system vendors have developed many approaches to providing such integrated access.In this (partly guested) article, SQL's new part, Management of External Data (SQL/MED), is explored to give readers a better notion of just how applications can use standard SQL to concurrently access their SQL data and their non-SQL data." SIGMOD Record The Clio Project: Managing Heterogeneity. Renée J. Miller,Mauricio A. Hernández,Laura M. Haas,Ling-Ling Yan,C. T. Howard Ho,Ronald Fagin,Lucian Popa 2001 Clio is a system for managing and facilitating the complex tasks of heterogeneous data transformation and integration. In Clio, we have collected together a powerful set of data management techniques that have proven invaluable in tackling these difficult problems. In this paper, we present the underlying themes of our approach and present a brief case study. SIGMOD Record A General Techniques for Querying XML Documents using a Relational Database System. Jayavel Shanmugasundaram,Eugene J. Shekita,Jerry Kiernan,Rajasekar Krishnamurthy,Stratis Viglas,Jeffrey F. Naughton,Igor Tatarinov 2001 There has been recent interest in using relational database systems to store and query XML documents. Each of the techniques proposed in this context works by (a) creating tables for the purpose of storing XML documents (also called relational schema generation), (b) storing XML documents by shredding them into rows in the created tables, and (c) converting queries over XML documents into SQL queries over the created tables. Since relational schema generation is a physical database design issue -- dependent on factors such as the nature of the data, the query workload and availability of schemas -- there have been many techniques proposed for this purpose. Currently, each relational schema generation technique requires its own query processor to efficiently convert queries over XML documents into SQL queries over the created tables. In this paper, we present an efficient technique whereby the same query-processor can be used for all such relational schema generation techniques. This greatly simplifies the task of relational schema generation by eliminating the need to write a special-purpose query processor for each new solution to the problem. In addition, our proposed technique enables users to query seamlessly across relational data and XML documents. This provides users with unified access to both relational and XML data without them having to deal with separate databases. SIGMOD Record "Chair's Message." M. Tamer Özsu 2001 "Chair's Message." SIGMOD Record "Chair's Message." M. Tamer Özsu 2001 "Chair's Message." SIGMOD Record Infosphere Project: System Support for Information Flow Applications. Calton Pu,Karsten Schwan,Jonathan Walpole 2001 We describe the Infosphere project, which is building the systems software support for information-driven applications such as digital libraries and electronic commerce. The main technical contribution is the Infopipe abstraction to support information flow with quality of service. Using building blocks such as program specialization, software feedback, domain-specific languages, and personalized information filtering, the Infopipe software generates code and manages resources to provide the specified quality of service with support for composition and restructuring. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Stefano Ceri,Luis Gravano,Per-Åke Larson,Leonid Libkin,Tova Milo 2001 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Jiawei Han,Nick Koudas,Rajeev Rastogi,Daniel J. Rosenkrantz,Peter Scheuermann,Dan Suciu 2001 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,James R. Hamilton,Laks V. S. Lakshmanan,Limsoon Wong 2001 Reminiscences on Influential Papers. SIGMOD Record Using Unknowns to Prevent Discovery of Association Rules. Yücel Saygin,Vassilios S. Verykios,Chris Clifton 2001 Data mining technology has given us new capabilities to identify correlations in large data sets. This introduces risks when the data is to be made public, but the correlations are private. We introduce a method for selectively removing individual values from a database to prevent the discovery of a set of rules, while preserving the data for other applications. The efficacy and complexity of this method are discussed. We also present an experiment showing an example of this methodology. SIGMOD Record Why And How To Benchmark XML Databases. Albrecht Schmidt,Florian Waas,Martin L. Kersten,Daniela Florescu,Michael J. Carey,Ioana Manolescu,Ralph Busse 2001 Benchmarks belong to the very standard repertory of tools deployed in database development. Assessing the capabilities of a system, analyzing actual and potential bottlenecks, and, naturally, comparing the pros and cons of different systems architectures have become indispensable tasks as databases management systems grow in complexity and capacity. In the course of the development of XML databases the need for a benchmark framework has become more and more evident: a great many different ways to store XML data have been suggested in the past, each with its genuine advantages, disadvantages and consequences that propagate through the layers of a complex database system and need to be carefully considered. The different storage schemes render the query characteristics of the data variably different. However, no conclusive methodology for assessing these differences is available to date.In this paper, we outline desiderata for a benchmark for XML databases drawing from our own experience of developing an XML repository, involvement in the definition of the standard query language, and experience with standard benchmarks for relational databases. SIGMOD Record Transactional Information Systems - Book Review. Marc H. Scholl 2001 Transactional Information Systems - Book Review. SIGMOD Record Java Support for Data-Intensive Systems: Experiences Building the Telegraph Dataflow System. Mehul A. Shah,Samuel Madden,Michael J. Franklin,Joseph M. Hellerstein 2001 Java Support for Data-Intensive Systems: Experiences Building the Telegraph Dataflow System. SIGMOD Record Research in Multi-Organizational Processes and Semantic Information Brokering at the LSDIS Lab. Amit P. Sheth,John A. Miller,Krys Kochut,Ismailcem Budak Arpinar 2001 Research in Multi-Organizational Processes and Semantic Information Brokering at the LSDIS Lab. SIGMOD Record Data Mining-based Intrusion Detectors: An Overview of the Columbia IDS Project. Salvatore J. Stolfo,Wenke Lee,Philip K. Chan,Wei Fan,Eleazar Eskin 2001 Data Mining-based Intrusion Detectors: An Overview of the Columbia IDS Project. SIGMOD Record "Chair's Message." Richard T. Snodgrass 2001 "Chair's Message." SIGMOD Record Accessibility of the Database Literature. Richard T. Snodgrass 2001 Accessibility of the Database Literature. SIGMOD Record On Database Theory and XML. Dan Suciu 2001 Over the years, the connection between database theory and database practice has weakened. We argue here that the new challenges posed by XML and its applications are strengthening this connection today. We illustrate three examples of theoretical problems arising from XML applications, based on our own research. SIGMOD Record The Ecobase Project: Database and Web Technologies for Environmental Information Systems. Asterio K. Tanaka,Patrick Valduriez 2001 The Ecobase Project: Database and Web Technologies for Environmental Information Systems. SIGMOD Record Mining Email Content for Author Identification Forensics. Olivier Y. de Vel,Alison Anderson,Malcolm Corney,George M. Mohay 2001 Mining Email Content for Author Identification Forensics. SIGMOD Record Jeffrey D. Ullman Speaks Out on the Future of Higher Education, Startups, Database Theory, and More. Marianne Winslett 2001 Jeffrey D. Ullman Speaks Out on the Future of Higher Education, Startups, Database Theory, and More. SIGMOD Record Gio Wiederhold Speaks Out on Moving into Academia in Mid-Career, How to Be an Effective Consultant, Why You Should Be a Program Manager as a Funding Agency, the Need for Ontology Algebra and Simulations, and More. Marianne Winslett 2001 Gio Wiederhold Speaks Out on Moving into Academia in Mid-Career, How to Be an Effective Consultant, Why You Should Be a Program Manager as a Funding Agency, the Need for Ontology Algebra and Simulations, and More. ICDE P2P Information Systems. Karl Aberer,Manfred Hauswirth 2002 P2P Information Systems. ICDE An Intuitive Framework for Understanding Changes in Evolving Data Streams. Charu C. Aggarwal 2002 An Intuitive Framework for Understanding Changes in Evolving Data Streams. ICDE Towards Meaningful High-Dimensional Nearest Neighbor Search by Human-Computer Interaction. Charu C. Aggarwal 2002 Towards Meaningful High-Dimensional Nearest Neighbor Search by Human-Computer Interaction. ICDE DBXplorer: A System for Keyword-Based Search over Relational Databases. Sanjay Agrawal,Surajit Chaudhuri,Gautam Das 2002 DBXplorer: A System for Keyword-Based Search over Relational Databases. ICDE A Distributed Database Server for Continuous Media. Walid G. Aref,Ann Christine Catlin,Ahmed K. Elmagarmid,Jianping Fan,J. Guo,Moustafa A. Hammad,Ihab F. Ilyas,Mirette S. Marzouk,Sunil Prabhakar,Abdelmounaam Rezgui,S. Teoh,Evimaria Terzi,Yi-Cheng Tu,Athena Vakali,Xingquan Zhu 2002 A Distributed Database Server for Continuous Media. ICDE Efficient OLAP Query Processing in Distributed Data Warehouse. Michael O. Akinde,Michael H. Böhlen,Theodore Johnson,Laks V. S. Lakshmanan,Divesh Srivastava 2002 Efficient OLAP Query Processing in Distributed Data Warehouse. ICDE Structural Joins: A Primitive for Efficient XML Query Pattern Matching. Shurug Al-Khalifa,H. V. Jagadish,Jignesh M. Patel,Yuqing Wu,Nick Koudas,Divesh Srivastava 2002 Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE Recovery Guarantees for General Multi-Tier Applications. Roger S. Barga,David B. Lomet,Gerhard Weikum 2002 Recovery Guarantees for General Multi-Tier Applications. ICDE From XML Schema to Relations: A Cost-Based Approach to XML Storage. Philip Bohannon,Juliana Freire,Prasan Roy,Jérôme Siméon 2002 From XML Schema to Relations: A Cost-Based Approach to XML Storage. ICDE Active XQuery. Angela Bonifati,Daniele Braga,Alessandro Campi,Stefano Ceri 2002 Active XQuery. ICDE Declarative Composition and Peer-to-Peer Provisioning of Dynamic Web Services. Boualem Benatallah,Quan Z. Sheng,Anne H. H. Ngu,Marlon Dumas 2002 Declarative Composition and Peer-to-Peer Provisioning of Dynamic Web Services. ICDE Keyword Searching and Browsing in Databases using BANKS. Gaurav Bhalotia,Arvind Hulgeri,Charuta Nakhe,Soumen Chakrabarti,S. Sudarshan 2002 Keyword Searching and Browsing in Databases using BANKS. ICDE Improving Range Query Estimation on Histograms. Francesco Buccafurri,Domenico Rosaci,Luigi Pontieri,Domenico Saccà 2002 Improving Range Query Estimation on Histograms. ICDE Evaluating Top-k Queries over Web-Accessible Databases. Nicolas Bruno,Luis Gravano,Amélie Marian 2002 Evaluating Top-k Queries over Web-Accessible Databases. ICDE Efficient Evaluation of Queries with Mining Predicates. Surajit Chaudhuri,Vivek R. Narasayya,Sunita Sarawagi 2002 Efficient Evaluation of Queries with Mining Predicates. ICDE Demonstration: Active Asynchronous Transaction Management in High-Autonomy Federated Environment Using Data Agents: Global Change Master Directory v8.0. Omran A. Bukhres,Srinivasan Sikkupparbathyam,Kishan Nagendra,Zina Ben-Miled,Marcelo Areal,Lola Olsen,Chris Gokey,David Kendig,Rosy Cordova,Gene Major,Janine Savage 2002 Demonstration: Active Asynchronous Transaction Management in High-Autonomy Federated Environment Using Data Agents: Global Change Master Directory v8.0. ICDE Exploring Aggregate Effect with Weighted Transcoding Graphs for Efficient Cache Replacement in Transcoding Proxies. Cheng-Yue Chang,Ming-Syan Chen 2002 Exploring Aggregate Effect with Weighted Transcoding Graphs for Efficient Cache Replacement in Transcoding Proxies. ICDE Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries. Jianjun Chen,David J. DeWitt,Jeffrey F. Naughton 2002 Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries. ICDE FAST: A New Sampling-Based Algorithm for Discovering Association Rules. Bin Chen,Peter J. Haas,Peter Scheuermann 2002 FAST: A New Sampling-Based Algorithm for Discovering Association Rules. ICDE A Sampling-Based Estimator for Top-k Query. Chung-Min Chen,Yibei Ling 2002 A Sampling-Based Estimator for Top-k Query. ICDE Efficient Filtering of XML Documents with XPath Expressions. Chee Yong Chan,Pascal Felber,Minos N. Garofalakis,Rajeev Rastogi 2002 The publish/subscribe paradigm is a popular model for allowing publishers (i.e., data generators) to selectively disseminate data to a large number of widely dispersed subscribers (i.e., data consumers) who have registered their interest in specific information items. Early publish/subscribe systems have typically relied on simple subscription mechanisms, such as keyword or ”bag of words” matching, or simple comparison predicates on attribute values. The emergence of XML as a standard for information exchange on the Internet has led to an increased interest in using more expressive subscription mechanisms (e.g., based on XPath expressions) that exploit both the structure and the content of published XML documents. Given the increased complexity of these new data-filtering mechanisms, the problem of effectively identifying the subscription profiles that match an incoming XML document poses a difficult and important research challenge. In this paper, we propose a novel index structure, termed XTrie, that supports the efficient filtering of XML documents based on XPath expressions. Our XTrie index structure offers several novel features that, we believe, make it especially attractive for large-scale publish/subscribe systems. First, XTrie is designed to support effective filtering based on complex XPath expressions (as opposed to simple, single-path specifications). Second, our XTrie structure and algorithms are designed to support both ordered and unordered matching of XML data. Third, by indexing on sequences of elements organized in a trie structure and using a sophisticated matching algorithm, XTrie is able to both reduce the number of unnecessary index probes as well as avoid redundant matchings, thereby providing extremely efficient filtering. Our experimental results over a wide range of XML document and XPath expression workloads demonstrate that our XTrie index structure outperforms earlier approaches by wide margins. ICDE A Fast Regular Expression Indexing Engine. Junghoo Cho,Sridhar Rajagopalan 2002 A Fast Regular Expression Indexing Engine. ICDE NAPA: Nearest Available Parking Lot Application. Hae Don Chon,Divyakant Agrawal,Amr El Abbadi 2002 NAPA: Nearest Available Parking Lot Application. ICDE Reverse Engineering for Web Data: From Visual to Semantic Structure. Christina Yip Chung,Michael Gertz,Neel Sundaresan 2002 Reverse Engineering for Web Data: From Visual to Semantic Structure. ICDE Detecting Changes in XML Documents. Gregory Cobena,Serge Abiteboul,Amélie Marian 2002 Detecting Changes in XML Documents. ICDE Fast Mining of Massive Tabular Data via Approximate Distance Computations. Graham Cormode,Piotr Indyk,Nick Koudas,S. Muthukrishnan 2002 Fast Mining of Massive Tabular Data via Approximate Distance Computations. ICDE HP-Inventing the Future of Storage. Nora Denzel 2002 HP-Inventing the Future of Storage. ICDE Decoupled Query Optimization for Federated Database Systems. Amol Deshpande,Joseph M. Hellerstein 2002 Decoupled Query Optimization for Federated Database Systems. ICDE YFilter: Efficient and Scalable Filtering of XML Documents. Yanlei Diao,Peter M. Fischer,Michael J. Franklin,Raymond To 2002 YFilter: Efficient and Scalable Filtering of XML Documents. ICDE Efficiently Ordering Query Plans for Data Integration. AnHai Doan,Alon Y. Halevy 2002 Efficiently Ordering Query Plans for Data Integration. ICDE Sequenced Subset Operators: Definition and Implementation. Joseph Dunn,Sean Davey,Anne Descour,Richard T. Snodgrass 2002 Sequenced Subset Operators: Definition and Implementation. ICDE TAILOR: A Record Linkage Tool Box. Mohamed G. Elfeky,Ahmed K. Elmagarmid,Vassilios S. Verykios 2002 TAILOR: A Record Linkage Tool Box. ICDE GADT: A Probability Space ADT for Representing and Querying the Physical World. Anton Faradjian,Johannes Gehrke,Philippe Bonnet 2002 GADT: A Probability Space ADT for Representing and Querying the Physical World. ICDE Techniques for Storing XM. Mary F. Fernández,Sihem Amer-Yahia 2002 Techniques for Storing XM. ICDE A Graphical XML Query Language. Sergio Flesca,Filippo Furfaro,Sergio Greco 2002 A Graphical XML Query Language. ICDE Geometric-Similarity Retrieval in Large Image Bases. Ioannis Fudos,Leonidas Palios,Evaggelia Pitoura 2002 Geometric-Similarity Retrieval in Large Image Bases. ICDE An Authorization System for Temporal Data. Avigdor Gal,Vijayalakshmi Atluri,Gang Xu 2002 An Authorization System for Temporal Data. ICDE Peer-to-Peer Data Management. Hector Garcia-Molina 2002 Peer-to-Peer Data Management. ICDE SCADDAR: An Efficient Randomized Technique to Reorganize Continuous Media Blocks. Ashish Goel,Cyrus Shahabi,Shu-Yuen Didi Yao,Roger Zimmermann 2002 SCADDAR: An Efficient Randomized Technique to Reorganize Continuous Media Blocks. ICDE Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation. Sudipto Guha,Nick Koudas 2002 Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation. ICDE Providing Database as a Service. Hakan Hacigümüs,Sharad Mehrotra,Balakrishna R. Iyer 2002 Providing Database as a Service. ICDE Mapping XML and Relational Schemas with Clio. Mauricio A. Hernández,Lucian Popa,Yannis Velegrakis,Renée J. Miller,Felix Naumann,Ching-Tien Ho 2002 Mapping XML and Relational Schemas with Clio. ICDE Efficient Algorithm for Projected Clustering. Eric Ka Ka Ng,Ada Wai-Chee Fu 2002 Efficient Algorithm for Projected Clustering. ICDE Lossy Reduction for Very High Dimensional Data. Chris Jermaine,Edward Omiecinski 2002 Lossy Reduction for Very High Dimensional Data. ICDE Out From Under the Trees. Chris Jermaine,Edward Omiecinski,Wai Gen Yee 2002 Out From Under the Trees. ICDE Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. Raghav Kaushik,Pradeep Shenoy,Philip Bohannon,Ehud Gudes 2002 Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. ICDE A Publish & Subscribe Architecture for Distributed Metadata Management. Markus Keidl,Alexander Kreutz,Alfons Kemper,Donald Kossmann 2002 A Publish & Subscribe Architecture for Distributed Metadata Management. ICDE XParent: An Efficient RDBMS-Based XML Database System. Haifeng Jiang,Hongjun Lu,Wei Wang,Jeffrey Xu Yu 2002 XParent: An Efficient RDBMS-Based XML Database System. ICDE OntoWebber: A Novel Approach for Managing Data on the We. Yuhui Jin,Sichun Xu,Stefan Decker,Gio Wiederhold 2002 OntoWebber: A Novel Approach for Managing Data on the We. ICDE Efficient Indexing Structures for Mining Frequent Patterns. Bin Lan,Beng Chin Ooi,Kian-Lee Tan 2002 Efficient Indexing Structures for Mining Frequent Patterns. ICDE Data Reduction by Partial Preaggregation. Per-Åke Larson 2002 Data Reduction by Partial Preaggregation. ICDE Using Unity to Semi-Automatically Integrate Relational Schema. Ramon Lawrence,Ken Barker 2002 Using Unity to Semi-Automatically Integrate Relational Schema. ICDE NeT & CoT: Inferring XML Schemas from Relational World. Dongwon Lee,Murali Mani,Frank Chiu,Wesley W. Chu 2002 NeT & CoT: Inferring XML Schemas from Relational World. ICDE Processing Reporting Function Views in a Data Warehouse Environment. Wolfgang Lehner,Wolfgang Hümmer,Lutz Schlesinger 2002 Processing Reporting Function Views in a Data Warehouse Environment. ICDE OSSM: A Segmentation Approach to Optimize Frequency Counting. Carson Kai-Sang Leung,Raymond T. Ng,Heikki Mannila 2002 OSSM: A Segmentation Approach to Optimize Frequency Counting. ICDE Multivariate Time Series Prediction via Temporal Classification. Bing Liu,Jing Liu 2002 Multivariate Time Series Prediction via Temporal Classification. ICDE Data Cleaning and XML: The DBLP Experience. Wai Lup Low,Wee Hyong Tok,Mong-Li Lee,Tok Wang Ling 2002 Data Cleaning and XML: The DBLP Experience. ICDE A Non-Blocking Parallel Spatial Join Algorithm. Gang Luo,Jeffrey F. Naughton,Curt J. Ellmann 2002 A Non-Blocking Parallel Spatial Join Algorithm. ICDE Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. Samuel Madden,Michael J. Franklin 2002 Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. ICDE Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching. Sergey Melnik,Hector Garcia-Molina,Erhard Rahm 2002 Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching. ICDE SG-WRAP: A Schema-Guided Wrapper Generato. Xiaofeng Meng,Hongjun Lu,Haiyan Wang,Mingzhe Gu 2002 SG-WRAP: A Schema-Guided Wrapper Generato. ICDE Streaming-Data Algorithms for High-Quality Clustering. "Liadan O'Callaghan,Adam Meyerson,Rajeev Motwani,Nina Mishra,Sudipto Guha" 2002 Streaming-Data Algorithms for High-Quality Clustering. ICDE Multiple Query Optimization by Cache-Aware Middleware Using Query Teamwork. "Kevin O'Gorman,Divyakant Agrawal,Amr El Abbadi" 2002 Multiple Query Optimization by Cache-Aware Middleware Using Query Teamwork. ICDE Mixing Querying and Navigation in MIX. Pratik Mukhopadhyay,Yannis Papakonstantinou 2002 Mixing Querying and Navigation in MIX. ICDE Runtime Data Declustering over SAN-Connected PC Cluster System. Masato Oguchi,Masaru Kitsuregawa 2002 Runtime Data Declustering over SAN-Connected PC Cluster System. ICDE Using Smodels (Declarative Logic Programming) to Verify Correctness of Certain Active Rules. Mutsumi Nakamura,Ramez Elmasri 2002 Using Smodels (Declarative Logic Programming) to Verify Correctness of Certain Active Rules. ICDE "Advanced Process-Based Component Integration in Telcordia's Cable OSS." Anne H. H. Ngu,Dimitrios Georgakopoulos,Donald Baker,Andrzej Cichocki,Joseph Desmarais,Peter Bates 2002 "Advanced Process-Based Component Integration in Telcordia's Cable OSS." ICDE Attribute Classification Using Feature Analysis. Felix Naumann,Ching-Tien Ho,Xuqing Tian,Laura M. Haas,Nimrod Megiddo 2002 Attribute Classification Using Feature Analysis. ICDE BestPeer: A Self-Configurable Peer-to-Peer System. Wee Siong Ng,Beng Chin Ooi,Kian-Lee Tan 2002 BestPeer: A Self-Configurable Peer-to-Peer System. ICDE Indexing Spatio-Temporal Data Warehouses. Dimitris Papadias,Yufei Tao,Panos Kalnis,Jun Zhang 2002 Indexing Spatio-Temporal Data Warehouses. ICDE Content-Based Video Indexing for the Support of Digital Library Search. Milan Petkovic,Roelof van Zwol,Henk Ernst Blok,Willem Jonker,Peter M. G. Apers,Menzo Windhouwer,Martin L. Kersten 2002 Content-Based Video Indexing for the Support of Digital Library Search. ICDE Similarity Search Over Time-Series Data Using Wavelets. Ivan Popivanov,Renée J. Miller 2002 Similarity Search Over Time-Series Data Using Wavelets. ICDE How Good Are Association-Rule Mining Algorithms? Vikram Pudi,Jayant R. Haritsa 2002 How Good Are Association-Rule Mining Algorithms? ICDE Indexing of Moving Objects for Location-Based Services. Simonas Saltenis,Christian S. Jensen 2002 Indexing of Moving Objects for Location-Based Services. ICDE Managing Complex and Varied Data with the IndexFabric(tm). Neal Sample,Brian F. Cooper,Michael J. Franklin,Gísli R. Hjaltason,Moshe Shadmon,Levy Cohe 2002 Managing Complex and Varied Data with the IndexFabric(tm). ICDE Integrating Workflow Management Systems with Business-to-Business Interaction Standard. Mehmet Sayal,Fabio Casati,Umeshwar Dayal,Ming-Chien Shan 2002 Integrating Workflow Management Systems with Business-to-Business Interaction Standard. ICDE Extensible and Similarity-Based Grouping for Data Integratio. Eike Schallehn,Kai-Uwe Sattler,Gunter Saake 2002 Extensible and Similarity-Based Grouping for Data Integratio. ICDE Design and Implementation of a High-Performance Distributed Web Crawler. Vladislav Shkapenyuk,Torsten Suel 2002 Design and Implementation of a High-Performance Distributed Web Crawler. ICDE An Efficient Index Structure for Shift and Scale Invariant Search of Multi-Attribute Time Sequences. Tamer Kahveci,Ambuj K. Singh,Aliekber Gürel 2002 An Efficient Index Structure for Shift and Scale Invariant Search of Multi-Attribute Time Sequences. ICDE The BINGO! Focused Crawler: From Bookmarks to Archetypes. Sergej Sizov,Stefan Siersdorfer,Martin Theobald,Gerhard Weikum 2002 The BINGO! Focused Crawler: From Bookmarks to Archetypes. ICDE The Evolution of eBusiness Integration-from Data to Proces. Dale Skeen 2002 The Evolution of eBusiness Integration-from Data to Proces. ICDE Specification-Based Data Reduction in Dimensional Data Warehouses. Janne Skyt,Christian S. Jensen,Torben Bach Pedersen 2002 "Many data warehouses contain massive amounts of data, accumulated over long periods of time. In some cases, it is necessary or desirable to either delete ''old'' data or to maintain the data at an aggregate level. This may be due to privacy concerns, in which case the data are aggregated to levels that ensure anonymity. Another reason is the desire to maintain a balance between the uses of data that change as the data age and the size of the data, thus avoiding overly large data warehouses. This paper presents effective techniques for data reduction that enable the gradual aggregation of detailed data as the data ages. With these techniques, data may be aggregated to higher levels as they age, enabling the maintenance of more compact, consolidated data and the compliance with privacy requirements. Special care is taken to avoid semantic problems in the aggregation process. The paper also describes the querying of the resulting data warehouses and an implementation strategy based on current database technology." ICDE StreamCorder: Fast Trial-and-Error Analysis in Scientific Databases. Etzard Stolte,Gustavo Alonso 2002 StreamCorder: Fast Trial-and-Error Analysis in Scientific Databases. ICDE Exploring Spatial Datasets with Histograms. Chengyu Sun,Divyakant Agrawal,Amr El Abbadi 2002 Exploring Spatial Datasets with Histograms. ICDE Cost Models for Overlapping and Multi-Version B-Trees. Yufei Tao,Dimitris Papadias,Jun Zhang 2002 Cost Models for Overlapping and Multi-Version B-Trees. ICDE Predator-Miner: Ad hoc Mining of Associations Rules within a Database Management System. Wee Hyong Tok,Twee-Hee Ong,Wai Lup Low,Indriyati Atmosukarto,Stéphane Bressan 2002 Predator-Miner: Ad hoc Mining of Associations Rules within a Database Management System. ICDE XGRIND: A Query-Friendly XML Compressor. Pankaj M. Tolani,Jayant R. Haritsa 2002 XGRIND: A Query-Friendly XML Compressor. ICDE Exploiting Punctuation Semantics in Data Streams. Peter A. Tucker,David Maier 2002 Exploiting Punctuation Semantics in Data Streams. ICDE Discovering Similar Multidimensional Trajectories. Michail Vlachos,Dimitrios Gunopulos,George Kollios 2002 We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Examples include features extracted from video clips, animal mobility experiments, sign language recognition, mobile phone usage and so on. Such data usually contain a great amount of noise, that degrades the performance of previously used metrics. Therefore, here we formalize non-metric similarity functions based on the Longest Common Subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to the similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translating of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and Time Warping distance functions (for real and synthetic data) and show the superiority of our approach, especially under the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach. ICDE Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic. Mengzhi Wang,Ngai Hang Chan,Spiros Papadimitriou,Christos Faloutsos,Tara M. Madhyastha 2002 Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic. ICDE Database Replication for the Mobile Era. Antoni Wolski 2002 Database Replication for the Mobile Era. ICDE Condensed Cube: An Efficient Approach to Reducing Data Cube Size. Wei Wang,Hongjun Lu,Jianlin Feng,Jeffrey Xu Yu 2002 Condensed Cube: An Efficient Approach to Reducing Data Cube Size. ICDE Query Estimation by Adaptive Sampling. Yi-Leh Wu,Divyakant Agrawal,Amr El Abbadi 2002 Query Estimation by Adaptive Sampling. ICDE A Framework Towards Efficient and Effective Sequence Clusterin. Wei Wang,Jiong Yang 2002 A Framework Towards Efficient and Effective Sequence Clusterin. ICDE The ATLaS System and Its Powerful Database Language Based on Simple Extensions of SQL. Haixun Wang,Carlo Zaniolo 2002 The ATLaS System and Its Powerful Database Language Based on Simple Extensions of SQL. ICDE Efficient Temporal Join Processing Using Indices. Donghui Zhang,Vassilis J. Tsotras,Bernhard Seeger 2002 Efficient Temporal Join Processing Using Indices. ICDE delta-Clusters: Capturing Subspace Correlation in a Large Data Set. Jiong Yang,Wei Wang,Haixun Wang,Philip S. Yu 2002 delta-Clusters: Capturing Subspace Correlation in a Large Data Set. SIGMOD Conference Web caching for database applications with Oracle Web Cache. Jesse Anton,Lawrence Jacobs,Xiang Liu,Jordan Parker,Zheng Zeng,Tie Zhong 2002 We discuss several important issues specific to Web caching for content dynamically generated from database applications. We present the techniques employed by Oracle Web Cache to address these issues. They include: content disambiguation based on information in addition to the URL, transparent session management, partial-page caching for personalization, and broad-scope invalidation with performance assurance heuristics. SIGMOD Conference Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search. Charu C. Aggarwal 2002 With the increased abilities for automated data collection made possible by modern technology, the typical sizes of data collections have continued to grow in recent years. In such cases, it may be desirable to store the data in a reduced format in order to improve the storage, transfer time, and processing requirements on the data. One of the challenges of designing effective data compression techniques is to be able to preserve the ability to use the reduced format directly for a wide range of database and data mining applications. In this paper, we propose the novel idea of hierarchical subspace sampling in order to create a reduced representation of the data. The method is naturally able to estimate the local implicit dimensionalities of each point very effectively, and thereby create a variable dimensionality reduced representation of the data. Such a technique has the advantage that it is very adaptive about adjusting its representation depending upon the behavior of the immediate locality of a data point. An interesting property of the subspace sampling technique is that unlike all other data reduction techniques, the overall efficiency of compression improves with increasing database size. This is a highly desirable property for any data reduction system since the problem itself is motivated by the large size of data sets. Because of its sampling approach, the procedure is extremely fast and scales linearly both with data set size and dimensionality. Furthermore, the subspace sampling technique is able to reveal important local subspace characteristics of high dimensional data which can be harnessed for effective solutions to problems such as selectivity estimation and approximate nearest neighbor search. SIGMOD Conference Visual COKO: a debugger for query optimizer development. Daniel J. Abadi,Mitch Cherniack 2002 "Query optimization generates plans to retrieve data requested by queries. Query rewriting, which is the first step of this process, rewrites a query expression into an equivalent form to prepare it for plan generation. COKO-KOLA introduced a new approach to query rewriting that enables query rewrites to be formally verified using an automated theorem prover [1]. KOLA is a language for expressing term rewriting rules that can be ""fired"" on query expressions. COKO is a language for expressing query rewriting transformations that are too complex to express with simple KOLA rules [2].COKO is a programming language designed for query optimizer development. Programming languages require debuggers, and in this demonstration, we illustrate our COKO debugger: Visual COKO. Visual COKO enables a query optimization developer to visually trace the execution of a COKO transformation. At every step of the transformation, the developer can view a tree-display that illustrates how the original query expression has evolved." SIGMOD Conference DBXplorer: enabling keyword search over relational databases. Sanjay Agrawal,Surajit Chaudhuri,Gautam Das 2002 DBXplorer: enabling keyword search over relational databases. SIGMOD Conference DBCache: database caching for web application servers. Mehmet Altinel,Qiong Luo,Sailesh Krishnamurthy,C. Mohan,Hamid Pirahesh,Bruce G. Lindsay,Honguk Woo,Larry Brown 2002 "Many e-Business applications today are being developed and deployed on multi-tier environments involving browser-based clients, web application servers and backend databases. The dynamic nature of these applications necessitates generating web pages on-demand, making middle-tier database caching an effective approach to achieve high scalability and performance [3]. In the DBCache project, we are incorporating a database cache feature in DB2 UDB by modifying the engine code and leveraging existing federated database functionality. This allows us to take advantage of DB2's sophisticated distributed query processing power for database caching. As a result, the user queries can be executed at either the local database cache or the remote backend server, or more importantly, the query can be partitioned and then distributed to both databases for cost optimum execution.DBCache also includes a cache initialization component that takes a backend database schema and SQL queries in the workload, and generates a middle-tier database schema for the cache. We have implemented an initial prototype of the system that supports table level caching. As DB2's functionality is extended, we will be able to support subtable level caching, XML data caching and caching of execution results of web services." SIGMOD Conference Coordinating backup/recovery and data consistency between database and file systems. Suparna Bhattacharya,C. Mohan,Karen Brannon,Inderpal Narang,Hui-I Hsiao,Mahadevan Subramanian 2002 "Managing a combined store consisting of database data and file data in a robust and consistent manner is a challenge for database systems and content management systems. In such a hybrid system, images, videos, engineering drawings, etc. are stored as files on a file server while meta-data referencing/indexing such files is created and stored in a relational database to take advantage of efficient search. In this paper we describe solutions for two potentially problematic aspects of such a data management system: backup/recovery and data consistency. We present algorithms for performing backup and recovery of the DBMS data in a coordinated fashion with the files on the file servers. Our algorithms for coordinated backup and recovery have been implemented in the IBM DB2/DataLinks product [1]. We also propose an efficient solution to the problem of maintaining consistency between the content of a file and the associated meta-data stored in the DBMS from a reader's point of view without holding long duration locks on meta-data tables. In the model, an object is directly accessed and edited in-place through normal file system APIs using a reference obtained via an SQL Query on the database. To relate file modifications to meta-data updates, the user issues an update through the DBMS, and commits both file and meta-data updates together." SIGMOD Conference ToXgene: a template-based data generator for XML. Denilson Barbosa,Alberto O. Mendelzon,John Keenleyside,Kelly A. Lyons 2002 ToXgene: a template-based data generator for XML. SIGMOD Conference Going public: open-source databases and database research. Philippe Bonnet 2002 "There are a number of database systems available free of charge for the research community, with complete access to the source code. Some of these systems result from completed research projects, others have been developed outside the research community. How can the database community best take advantage of these publically available systems? The most widely used open-source database is MySQL. Their objective is to become the 'best and most used database in the world'. Can they do it without the database research community?" SIGMOD Conference Exploiting statistics on query expressions for optimization. Nicolas Bruno,Surajit Chaudhuri 2002 Statistics play an important role in influencing the plans produced by a query optimizer. Traditionally, optimizers use statistics built over base tables and assume independence between attributes while propagating statistical information through the query plan. This approach can introduce large estimation errors, which may result in the optimizer choosing inefficient execution plans. In this paper, we show how to extend a generic optimizer so that it also exploits statistics built on expressions corresponding to intermediate nodes of query plans. We show that in some cases, the quality of the resulting plans is significantly better than when only base-table statistics are available. Unfortunately, even moderately-sized schemas may have too many relevant candidate statistics. We introduce a workload-driven technique to identify a small subset of statistics that can provide significant benefits over just maintaining base-table statistics. Finally, we present experimental results on an implementation of our approach in Microsoft SQL Server 2000. SIGMOD Conference Holistic twig joins: optimal XML pattern matching. Nicolas Bruno,Nick Koudas,Divesh Srivastava 2002 XML employs a tree-structured data model, and, naturally, XML queries specify patterns of selection predicates on multiple elements related by a tree structure. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing. Prior work has typically decomposed the twig pattern into binary structural (parent-child and ancestor-descendant) relationships, and twig matching is achieved by: (i) using structural join algorithms to match the binary relationships against the XML database, and (ii) stitching together these basic matches. A limitation of this approach for matching twig patterns is that intermediate result sizes can get large, even when the input and output sizes are more manageable.In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML query twig pattern. Our technique uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern. When the twig pattern uses only ancestor-descendant relationships between elements, TwigStack is I/O and CPU optimal among all sequential algorithms that read the entire input: it is linear in the sum of sizes of the input lists and the final result list, but independent of the sizes of intermediate results. We then show how to use (a modification of) B-trees, along with TwigStack, to match query twig patterns in sub-linear time. Finally, we complement our analysis with experimental results on a range of real and synthetic data, and query twig patterns. SIGMOD Conference A compact B-tree. Peter Bumbulis,Ivan T. Bowman 2002 In this paper we describe a Patricia tree-based B-tree variant suitable for OLTP. In this variant, each page of the B-tree contains a local Patricia tree instead of the usual sorted array of keys. It has been implemented in iAnywhere ASA Version 8.0. Preliminary experience has shown that these indexes can provide significant space and performance benefits over existing ASA indexes. SIGMOD Conference Archiving scientific data. Peter Buneman,Sanjeev Khanna,Keishi Tajima,Wang Chiew Tan 2002 Archiving is important for scientific data, where it is necessary to record all past versions of a database in order to verify findings based upon a specific version. Much scientific data is held in a hierachical format and has a key structure that provides a canonical identification for each element of the hierarchy. In this article, we exploit these properties to develop an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches. The approach also uses timestamps. All versions of the data are merged into one hierarchy where an element appearing in multiple versions is stored only once along with a timestamp. By identifying the semantic continuity of elements and merging them into one data structure, our technique is capable of providing meaningful change descriptions, the archive allows us to easily answer certain temporal queries such as retrieval of any specific version from the archive and finding the history of an element. This is in contrast with approaches that store a sequence of deltas where such operations may require undoing a large number of changes or significant reasoning with the deltas. A suite of experiments also demonstrates that our archive does not incur any significant space overhead when contrasted with diff approaches. Another useful property of our approach is that we use XML format to represent hierarchical data and the resulting archive is also in XML. Hence, XML tools can be directly applied on our archive. In particular, we apply an XML compressor on our archive, and our experiments show that our compressed archive outperforms compressed diff-based repositories in space efficiency. We also show how we can extend our archiving tool to an external memory archiver for higher scalability and describe various index structures that can further improve the efficiency of some temporal queries on our archive. SIGMOD Conference Software as a service: ASP and ASP aggregation. Christoph Bussler 2002 "The tutorial ""Software as a Service: ASP and ASP aggregation"" will give an introduction and overview of the concept of ""renting"" access to software to customers (subscribers). Application service providers (ASPs) are enterprises hosting one or more applications and provide access to subscribers over the Internet by means of browser technology. Furthermore, the underlying technologies are discussed to enable application hosting. The concept of ASP aggregation is introduced to provide a single access point and a single sign-on capability to subscribers sub-scribing to more than one hosted application in more than one ASP." SIGMOD Conference Minimal probing: supporting expensive predicates for top-k queries. Kevin Chen-Chuan Chang,Seung-won Hwang 2002 "This paper addresses the problem of evaluating ranked top-k queries with expensive predicates. As major DBMSs now all support expensive user-defined predicates for Boolean queries, we believe such support for ranked queries will be even more important: First ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. Second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. Third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. These predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. The current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce top-k answers. To minimize expensive probes, we thus develop the formal principle of ""necessary probes,"" which determines if a probe is absolutely required. We then propose Algorithm MPro which, by implementing the principle, is provably optimal with minimal probe cost. Further, we show that MPro can scale well and can be easily parallelized. Our experiments using both a real-estate benchmark database and synthetic datasets show that MPro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing." SIGMOD Conference Compressing SQL workloads. Surajit Chaudhuri,Ashish Kumar Gupta,Vivek R. Narasayya 2002 Recently several important relational database tasks such as index selection, histogram tuning, approximate query processing, and statistics selection have recognized the importance of leveraging workloads. Often these tasks are presented with large workloads, i.e., a set of SQL DML statements, as input. A key factor affecting the scalability of such tasks is the size of the workload. In this paper, we present the novel problem of workload compression which helps improve the scalability of such tasks. We present a principled solution to this challenging problem. Our solution is broadly applicable to a variety of workload-driven tasks, while allowing for incorporation of task specific knowledge. We have implemented this solution and our experiments illustrate its effectiveness in the context of two workload-driven tasks: index selection and approximate query processing. SIGMOD Conference Fractal prefetching B±Trees: optimizing both cache and disk performance. Shimin Chen,Phillip B. Gibbons,Todd C. Mowry,Gary Valentin 2002 Fractal prefetching B±Trees: optimizing both cache and disk performance. SIGMOD Conference XCache: a semantic caching system for XML queries. Li Chen,Elke A. Rundensteiner,Song Wang 2002 "A wide range of Web applications retrieve desired information from remote XML data sources across the Internet, which is usually costly due to transmission delays for large volumes of data. Therefore we propose to apply the ideas of semantic caching to XML query processing systems [2], in particular the XQuery engine. Semantic caching [3] implies view-based query answering and cache management. While it is well studied in the traditional database context, query containment for XQuery is left unexplored due to its complexity coming with the powerful expressiveness of hierarchy, recursion and result construction. We hence have developed the first solution for XQuery processing using cached views.We exploit the connections between XML and tree automata, and use subtype relations between two regular expression types to tackle the XQuery containment mapping problem. Inspired by XDuce [1], which explores the use of tree-automata-based regular expression types for XML processing, we have designed a containment mapping process to incorporate type inference and subtyping mechanisms provided by XDuce to establish containment mappings between regular-expression-type-based pattern variables of two queries. We have implemented a semantic caching system called XCache (see Figure 1), to realize the proposed containment and rewriting techniques for XQuery.The main modules of XCache include: (1) Query Decomposer. An input query is is decomposed into source-specific subqueries explicitly represented by matching patterns and return structures. (2) Query Pattern Register. By registering a few queries into semantic regions, we warm up XCache at its initialization phase. (3) Query Containment Mapper. The XDuce subtyper is incorporated into the containment mapper for establishing query containment mappings between variables of a new query and each cached query. (4) Query Rewriter. We implement the classical bucket algorithm and further apply heuristics to decide on an ""optimal"" rewriting plan if several valid ones exist. (5) Replacement Manager. We free space for new regions by both complete and partial replacement. (6) Region Coalescer. We apply a coalescing strategy to control the region granularity over time." SIGMOD Conference Selectivity estimation for spatio-temporal queries to moving objects. Yong-Jin Choi,Chin-Wan Chung 2002 A query optimizer requires selectivity estimation of a query to choose the most efficient access plan. An effective method of selectivity estimation for the future locations of moving objects has not yet been proposed. Existing methods for spatial selectivity estimation do not accurately estimate the selectivity of a query to moving objects, because they do not consider the future locations of moving objects, which change continuously as time passes.In this paper, we propose an effective method for spatio-temporal selectivity estimation to solve this problem. We present analytical formulas which accurately calculate the selectivity of a spatio-temporal query as a function of spatio-temporal information. Extensive experimental results show that our proposed method accurately estimates the selectivity over various queries to spatio-temporal data combining real-life spatial data and synthetic temporal data. When Tiger/lines is used as real-life spatial data, the application of an existing method for spatial selectivity estimation to the estimation of the selectivity of a query to moving objects has the average error ratio from 14% to 85%, whereas our method for spatio-temporal selectivity estimation has the average error ratio from 9% to 23%. SIGMOD Conference APEX: an adaptive path index for XML data. Chin-Wan Chung,Jun-Ki Min,Kyuseok Shim 2002 "The emergence of the Web has increased interests in XML data. XML query languages such as XQuery and XPath use label paths to traverse the irregularly structured data. Without a structural summary and efficient indexes, query processing can be quite inefficient due to an exhaustive traversal on XML data. To overcome the inefficiency, several path indexes have been proposed in the research community. Traditional indexes generally record all label paths from the root element in XML data. Such path indexes may result in performance degradation due to large sizes and exhaustive navigations for partial matching path queries start with the self-or-descendent axis(""//"").In this paper, we propose APEX, an adaptive path index for XML data. APEX does not keep all paths starting from the root and utilizes frequently used paths to improve the query performance. APEX also has a nice property that it can be updated incrementally according to the changes of query workloads. Experimental results with synthetic and real-life data sets clearly confirm that APEX improves query processing cost typically 2 to 54 times better than the existing indexes, with the performance gap increasing with the irregularity of XML data." SIGMOD Conference Implementing XQuery. Paul Cotton 2002 Implementing XQuery. SIGMOD Conference Gigascope: high performance network monitoring with an SQL interface. Charles D. Cranor,Yuan Gao,Theodore Johnson,Vladislav Shkapenyuk,Oliver Spatscheck 2002 Operators of large networks and providers of network services need to monitor and analyze the network traffic flowing through their systems. Monitoring requirements range from the long term (e.g., monitoring link utilizations, computing traffic matrices) to the ad-hoc (e.g. detecting network intrusions, debugging performance problems). Many of the applications are complex (e.g., reconstruct TCP/IP sessions), query layer-7 data (find streaming media connections), operate over huge volumes of data (Gigabit and higher speed links), and have real-time reporting requirements (e.g., to raise performance or intrusion alerts).We have found that existing network monitoring technologies have severe limitations. One option is to use TCPdump to monitor a network port and a user-level application program to process the data. While this approach is very flexible, it is not fast enough to handle gigabit speeds on inexpensive equipment. Another approach is to use network monitoring devices. While these devices are capable of high speed monitoring, they are inflexible as the set of monitoring tasks is pre-defined. Adding new functionality is expensive and has long lead times. A similar approach is to use monitoring tools built into routers, such as SNMP, RMON, or NetFlow. These tools have similar characteristics --- fast but inflexible.A further problem with all of these tools is their lack of a query interface. The data from the monitors are dumped to a file or piped through a file stream without an association to the semantics of the data. The burden of managing and interpreting the data is left to the analyst. Due to the volume and complexity of the data, the burden can be severe. These problems make developing new applications needlessly slow and difficult. Also, many mistakes are made leading to incorrect analyses. SIGMOD Conference RoadRunner: automatic data extraction from data-intensive web sites. Valter Crescenzi,Giansalvatore Mecca,Paolo Merialdo 2002 RoadRunner: automatic data extraction from data-intensive web sites. SIGMOD Conference Mining database structure; or, how to build a data quality browser. Tamraparni Dasu,Theodore Johnson,S. Muthukrishnan,Vladislav Shkapenyuk 2002 Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make data-driven discoveries, most of the time spent in performing an analysis is spent in data identification, gathering, cleaning and processing the data. Similarly, schema mapping tools have been developed to help automate the task of using legacy or federated data sources for a new purpose, but assume that the structure of the data sources is well understood. However the data sets to be federated may come from dozens of databases containing thousands of tables and tens of thousands of fields, with little reliable documentation about primary keys or foreign keys.We are developing a system, Bellman, which performs data mining on the structure of the database. In this paper, we present techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database. The results of the database structure mining allow the analyst to make sense of the database content. This information can be used to e.g., prepare data for data mining, find foreign key joins for schema mapping, or identify steps to be taken to prevent the database from collapsing under the weight of its complexity. SIGMOD Conference Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation. Anindya Datta,Kaushik Dutta,Helen M. Thomas,Debra E. VanderMeer,Suresha,Krithi Ramamritham 2002 As Internet traffic continues to grow and web sites become increasingly complex, performance and scalability are major issues for web sites. Web sites are increasingly relying on dynamic content generation applications to provide web site visitors with dynamic, interactive, and personalized experiences. However, dynamic content generation comes at a cost --- each request requires computation as well as communication across multiple components.To address these issues, various dynamic content caching approaches have been proposed. Proxy-based caching approaches store content at various locations outside the site infrastructure and can improve Web site performance by reducing content generation delays, firewall processing delays, and bandwidth requirements. However, existing proxy-based caching approaches either (a) cache at the page level, which does not guarantee that correct pages are served and provides very limited reusability, or (b) cache at the fragment level, which requires the use of pre-defined page layouts. To address these issues, several back end caching approaches have been proposed, including query result caching and fragment level caching. While back end approaches guarantee the correctness of results and offer the advantages of fine-grained caching, they neither address firewall delays nor reduce bandwidth requirements.In this paper, we present an approach and an implementation of a dynamic proxy caching technique which combines the benefits of both proxy-based and back end caching approaches, yet does not suffer from their above-mentioned limitations. Our dynamic proxy caching technique allows granular, proxy-based caching where both the content and layout can be dynamic. Our analysis of the performance of our approach indicates that it is capable of providing significant reductions in bandwidth. We have also deployed our proposed dynamic proxy caching technique at a major financial institution. The results of this implementation indicate that our technique is capable of providing order-of-magnitude reductions in bandwidth and response times in real-world dynamic Web applications. SIGMOD Conference Processing complex aggregate queries over data streams. Alin Dobra,Minos N. Garofalakis,Johannes Gehrke,Rajeev Rastogi 2002 "Recent years have witnessed an increasing interest in designing algorithms for querying and analyzing streaming data (i.e., data that is seen only once in a fixed order) with only limited memory. Providing (perhaps approximate) answers to queries over such continuous data streams is a crucial requirement for many application environments; examples include large telecom and IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed.In this paper, we consider the problem of approximately answering general aggregate SQL queries over continuous data streams with limited memory. Our method relies on randomizing techniques that compute small ""sketch"" summaries of the streams that can then be used to provide approximate answers to aggregate queries with provable guarantees on the approximation error. We also demonstrate how existing statistical information on the base data (e.g., histograms) can be used in the proposed framework to improve the quality of the approximation provided by our algorithms. The key idea is to intelligently partition the domain of the underlying attribute(s) and, thus, decompose the sketching problem in a way that provably tightens our guarantees. Results of our experimental study with real-life as well as synthetic data streams indicate that sketches provide significantly more accurate answers compared to histograms for aggregate queries. This is especially true when our domain partitioning methods are employed to further boast the accuracy of the final estimates." SIGMOD Conference An ebXML infrastructure implementation through UDDI registries and RosettaNet PIPs. Asuman Dogac,Yusuf Tambag,Pinar Pembecioglu,Sait Pektas,Gokce Laleci,Gokhan Kurt,Serkan Toprak,Yildiray Kabak 2002 "Today's Internet based businesses need a level of interoperability which will allow trading partners to seamlessly and dynamically come together and do business without ad hoc and proprietary integrations. Such a level of interoperability involves being able to find potential business partners, discovering their services and business processes, and conducting business ""on the fly"". This process of dynamic interoperation is only possible through standard B2B frameworks. Indeed a number of B2B electronic commerce standard frameworks have emerged recently. Although most of these standards are overlapping and competing, each with its own strenghts and weeknesses, a closer investigation reveals that they can be used in a manner to complement one another.In this paper we describe such an implementation where an ebXML infrastructure is developed by exploiting the Universal Description, Discovery and Integration (UDDI) registries and RosettaNet Partner Interface Processes (PIPs). ebXML is an ambitious effort and produced detailed specifications of an infrastructure both for B2B and B2C e-commerce. However a public ebXML compliant registry/repository mechanism is not available yet. On the other hand, UDDI's approach to developing a registry has been a lot simpler and public registries are available. In ebXML, trading parties collaborate by agreeing on the same business process with complementary roles. Therefore there is a need for standardized business processes. In this respect, exploiting the already developed expertise through RosettaNet PIPs becomes indispensable. We show how to create and use ebXML ""Binary Collaborations"" based on RosettaNet PIPs and provide a GUI tool to allow users to graphically build their ebXML business processes by combining RosettaNet PIPs. In ebXML, trading parties reveal essential information about themselves through Collaboration Protocol Profiles (CPPs). To conduct business, an agreement between parties is necessary and this is expressed" SIGMOD Conference XL: a platform for web services. Daniela Florescu,Andreas Grünhagen,Donald Kossmann,Steffen Rost 2002 We present a platform for Web services. Web services are implemented in a special XML programming language called XL [1, 2]. A Web service receives an XML message as input and returns an XML message as output. The platform supports a number of features that are particularly useful to implement Web services; e.g., logging, timetables, conversations, workflow management, automatic transactions, security. Our platform is going to be compliant with all W3C standards and emerging proposals. The programming language is very abstract and can be optimized automatically (like SQL). Furthermore, the platform allows to integrate Web services that are written in XL and other programming languages. SIGMOD Conference StatiX: making XML count. Juliana Freire,Jayant R. Haritsa,Maya Ramanath,Prasan Roy,Jérôme Siméon 2002 The availability of summary data for XML documents has many applications, from providing users with quick feedback about their queries, to cost-based storage design and query optimization. StatiX is a novel XML Schema-aware statistics framework that exploits the structure derived by regular expressions (which define elements in an XML Schema) to pinpoint places in the schema that are likely sources of structural skew. As we discuss below, this information can be used to build concise, yet accurate, statistical summaries for XML data. StatiX leverages standard XML technology for gathering statistics, notably XML Schema validators, and it uses histograms to summarize both the structure and values in an XML document. In this paper we describe the StatiX system. We develop algorithms that decompose schemas to obtain statistics at different granularities and discuss how statistics can be gathered as documents are validated. We also present an experimental evaluation which demonstrates the accuracy and scalability of our approach and show an application of these statistics to cost-based XML storage design. SIGMOD Conference COUGAR: the network is the database. Wai Fu Fung,David Sun,Johannes Gehrke 2002 "The widespread distribution and availability of small-scale sensors, actuators, and embedded processors is transforming the physical world into a computing platform. One such example is a sensor network consisting of a large number of sensor nodes that combine physical sensing capabilities such as temperature, light, or seismic sensors with networking and computation capabilities [1]. Applications range from environmental control, warehouse inventory, health care to military environments. Existing sensor networks assume that the sensors are preprogrammed and send data to a central frontend where the data is aggregated and stored for offfsline querying and analysis. This approach has two major draw-backs. First, the user cannot change the behavior of the system on the fly. Second, communication in today's networks is orders of magnitude more expensive than local computation, thus in-network processing can vastly reduce resource usage and thus extend the lifetime of a sensor network.This demo demonstrates a database approach to unite the seemingly conflicting requirements of scalability and flexibility in monitoring the physical world. We demonstrate the COUGAR System, a new distributed data management infrastructure that scales with the growth of sensor interconnectivity and computational power on the sensors over the next decades. Our system resides directly on the sensor nodes and creates the abstraction of a single processing node without centralizing data or computation." SIGMOD Conference Continually evaluating similarity-based pattern queries on a streaming time series. Like Gao,Xiaoyang Sean Wang 2002 In many applications, local or remote sensors send in streams of data, and the system needs to monitor the streams to discover relevant events/patterns and deliver instant reaction correspondingly. An important scenario is that the incoming stream is a continually appended time series, and the patterns are time series in a database. At each time when a new value arrives (called a time position), the system needs to find, from the database, the nearest or near neighbors of the incoming time series up to the time position. This paper attacks the problem by using Fast Fourier Transform (FFT) to efficiently find the cross correlations of time series, which yields, in a batch mode, the nearest and near neighbors of the incoming time series at many time positions. To take advantage of this batch processing in achieving fast response time, this paper uses prediction methods to predict future values. FFT is used to compute the cross correlations of the predicted series (with the values that have already arrived) and the database patterns, and to obtain predicted distances between the incoming time series at many future time positions and the database patterns. When the actual data value arrives, the prediction error together with the predicted distances is used to filter out patterns that are not possible to be the nearest or near neighbors, which provides fast responses. Experiments show that with reasonable prediction errors, the performance gain is significant. SIGMOD Conference Wavelet synopses with error guarantees. Minos N. Garofalakis,Phillip B. Gibbons 2002 "Recent work has demonstrated the effectiveness of the wavelet decomposition in reducing large amounts of data to compact sets of wavelet coefficients (termed ""wavelet synopses"") that can be used to provide fast and reasonably accurate approximate answers to queries. A major criticism of such techniques is that unlike, for example, random sampling, conventional wavelet synopses do not provide informative error guarantees on the accuracy of individual approximate answers. In fact, as this paper demonstrates, errors can vary widely (without bound) and unpredictably, even for identical queries on nearly-identical values in distinct parts of the data. This lack of error guarantees severely limits the practicality of traditional wavelets as an approximate query-processing tool, because users have no idea of the quality of any particular approximate answer. In this paper, we introduce Probabilistic Wavelet Synopses, the first wavelet-based data reduction technique with guarantees on the accuracy of individual approximate answers. Whereas earlier approaches rely on deterministic thresholding for selecting a set of ""good"" wavelet coefficients, our technique is based on a novel, probabilistic thresholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values, and then flips coins to select the synopsis. We show how our scheme avoids the above pitfalls of deterministic thresholding, providing highly-accurate answers for individual data values in a data vector. We propose several novel optimization algorithms for tuning our probabilistic thresholding scheme to minimize desired error metrics. Experimental results on real-world and synthetic data sets evaluate these algorithms, and demonstrate the effectiveness of our probabilistic wavelet synopses in providing fast, highly-accurate answers with error guarantees." SIGMOD Conference Querying and mining data streams: you only get one look a tutorial. Minos N. Garofalakis,Johannes Gehrke,Rajeev Rastogi 2002 Querying and mining data streams: you only get one look a tutorial. SIGMOD Conference Accelerating XPath location steps. Torsten Grust 2002 This work is a proposal for a database index structure that has been specifically designed to support the evaluation of XPath queries. As such, the index is capable to support all XPath axes (including ancestor, following, preceding-sibling, descendant-or-self, etc.). This feature lets the index stand out among related work on XML indexing structures which had a focus on regular path expressions (which correspond to the XPath axes children and descendant-or-self plus name tests). Its ability to start traversals from arbitrary context nodes in an XML document additionally enables the index to support the evaluation of path traversals embedded in XQuery expressions. Despite its flexibility, the new index can be implemented and queried using purely relational techniques, but it performs especially well if the underlying database host provides support for R-trees. A performance assessment which shows quite promising results completes this proposal. SIGMOD Conference Approximate XML joins. Sudipto Guha,H. V. Jagadish,Nick Koudas,Divesh Srivastava,Ting Yu 2002 XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent data from a wide variety sources. Hence, XML is likely to be the format through which data from multiple sources is integrated.In this paper we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently approximate match in structure, in addition to, content has to be folded in the join operation. We quantify approximate match in structure and content using well defined notions of distance. For structure, we propose computationally inexpensive lower and upper bounds for the tree edit distance metric between two trees. We then show how the tree edit distance, and other metrics that quantify distance between trees, can be incorporated in a join framework. We introduce the notion of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set and we propose sampling based algorithms to identify them. This gives rise to a variety of algorithmic approaches for the problem, which we formulate and analyze. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets. SIGMOD Conference Workflow management with service quality guarantees. Michael Gillmann,Gerhard Weikum,Wolfgang Wonner 2002 "Workflow management systems (WFMS) that are geared for the orchestration of business processes across multiple organizations are complex distributed systems: they consist of multiple workflow engines, application servers, and communication middleware servers such as ORBs, where each of these server types can be replicated on multiple computers for scalability and availability.Finding an appropriate system configuration with guaranteed application-specific quality of service in terms of throughput, response time, and tolerable downtime is a major challenge for human system administrators. This paper presents a tool that largely automates the task of configuring a distributed WFMS. Based on a suite of mathematical models, the tool derives the necessary degrees of replication for the various server types in order to meet specified goals for performance and availability as well as ""performability"" when service is degraded due to outages of individual servers. The paper describes the configuration tool, with emphasis on how to capture the load behavior of workflows in a realistic manner. We also present extensive experiments that evaluate the accuracy of the tool's underlying models and demonstrate the practical feasibility of automating the task of configuring a distributed WFMS. The experiments use a detailed simulation which in turn has been validated through measurements with the Mentor-lite prototype system." SIGMOD Conference Executing SQL over encrypted data in the database-service-provider model. Hakan Hacigümüs,Balakrishna R. Iyer,Chen Li,Sharad Mehrotra 2002 "Rapid advances in networking and Internet technologies have fueled the emergence of the ""software as a service"" model for enterprise computing. Successful examples of commercially viable software services include rent-a-spreadsheet, electronic mail services, general storage services, disaster protection services. ""Database as a Service"" model provides users power to create, store, modify, and retrieve data from anywhere in the world, as long as they have access to the Internet. It introduces several challenges, an important issue being data privacy. It is in this context that we specifically address the issue of data privacy.There are two main privacy issues. First, the owner of the data needs to be assured that the data stored on the service-provider site is protected against data thefts from outsiders. Second, data needs to be protected even from the service providers, if the providers themselves cannot be trusted. In this paper, we focus on the second challenge. Specifically, we explore techniques to execute SQL queries over encrypted data. Our strategy is to process as much of the query as possible at the service providers' site, without having to decrypt the data. Decryption and the remainder of the query processing are performed at the client site. The paper explores an algebraic framework to split the query to minimize the computation at the client site. Results of experiments validating our approach are also presented." SIGMOD Conference CubeExplorer: online exploration of data cubes. Jiawei Han,Jianyong Wang,Guozhu Dong,Jian Pei,Ke Wang 2002 Data cube enables fast online analysis of large data repositories which is attractive in many applications. Although there are several kinds of available cube-based OLAP products, users may still encounter challenges on effectiveness and efficiency in the exploration of large data cubes due to the huge computation space as well as the huge observation space in a data cube. CubeExplorer is an integrated environment for online exploration of data cubes. It integrates our newly developed techniques on iceberg cube computation [2], cube-based feature extraction, and gradient analysis [1], and makes cube exploration effective and efficient. In this demo, we will show the features of CubeExplorer, especially its power and flexibility at exploring and mining of large databases. SIGMOD Conference Data streams: fresh current or stagnant backwater? (panel). Joseph M. Hellerstein,Jennifer Widom 2002 Data streams: fresh current or stagnant backwater? (panel). SIGMOD Conference HD-Eye: visual clustering of high dimensional data. Alexander Hinneburg,Daniel A. Keim,Markus Wawryniuk 2002 Clustering of large data bases is an important research area with a large variety of applications in the data base context. Missing in most of the research efforts are means for guiding the clustering process and understanding the results, which is especially important for high dimensional data. Visualization technology may help to solve this problem since it provides effective support of different clustering paradigms and allows a visual inspection of the results. The HD-Eye (high-dim. eye) system shows that a tight integration of advanced clustering algorithms and state-of-the-art visualization techniques is powerful for a better understanding and effective guidance of the clustering process, and therefore can help to significantly improve the clustering results. SIGMOD Conference Garlic: a new flavor of federated query processing for DB2. Vanja Josifovski,Peter M. Schwarz,Laura M. Haas,Eileen Tien Lin 2002 "In a large modern enterprise, information is almost inevitably distributed among several database management systems. Despite considerable attention from the research community, relatively few commercial systems have attempted to address this issue. This paper describes new technology that enables clients of IBM's DB2 Universal Database to access the data and specialized computational capabilities of a wide range of non-relational data sources. This technology, based on the Garlic prototype developed at the Almaden Research Center, complements and extends DB2's existing ability to federate relational data sources.The paper focuses on three topics. Firstly, we show how the DB2 catalogs are used as an extensible repository for the metadata needed to access remotely-stored information. Secondly, we describe how the Garlic approach to query planning, in which source-specific modules and the federated server cooperate to develop an optimized execution plan, has been realized in DB2. Lastly, we describe how DB2's query execution engine has been extended to support queries and functions that are evaluated remotely." SIGMOD Conference An adaptive peer-to-peer network for distributed caching of OLAP results. Panos Kalnis,Wee Siong Ng,Beng Chin Ooi,Dimitris Papadias,Kian-Lee Tan 2002 Peer-to-Peer (P2P) systems are becoming increasingly popular as they enable users to exchange digital information by participating in complex networks. Such systems are inexpensive, easy to use, highly scalable and do not require central administration. Despite their advantages, however, limited work has been done on employing database systems on top of P2P networks.Here we propose the PeerOLAP architecture for supporting On-Line Analytical Processing queries. A large number low-end clients, each containing a cache with the most useful results, are connected through an arbitrary P2P network. If a query cannot be answered locally (i.e. by using the cache contents of the computer where it is issued), it is propagated through the network until a peer that has cached the answer is found. An answer may also be constructed by partial results from many peers. Thus PeerOLAP acts as a large distributed cache, which amplifies the benefits of traditional client-side caching. The system is fully distributed and can reconfigure itself on-the-fly in order to decrease the query cost for the observed workload. This paper describes the core components of PeerOLAP and presents our results both from simulation and a prototype installation running on geographically remote peers. SIGMOD Conference Quadtree and R-tree indexes in oracle spatial: a comparison using GIS data. Kothuri Venkata Ravi Kanth,Siva Ravada,Daniel Abugov 2002 Spatial indexing has been one of the active focus areas in recent database research. Several variants of Quadtree and R-tree indexes have been proposed in database literature. In this paper, we first describe briefly our implementation of Quadtree and R-tree index structures and related optimizations in Oracle Spatial. We then examine the relative merits of two structures as implemented in Oracle Spatial and compare their performance for different types of queries and other operations. Finally, we summarize experiences with these different structures in indexing large GIS datasets in Oracle Spatial. SIGMOD Conference ACDN: a content delivery network for applications. Pradnya Karbhari,Michael Rabinovich,Zhen Xiao,Fred Douglis 2002 ACDN: a content delivery network for applications. SIGMOD Conference Covering indexes for branching path queries. Raghav Kaushik,Philip Bohannon,Jeffrey F. Naughton,Henry F. Korth 2002 In this paper, we ask if the traditional relational query acceleration techniques of summary tables and covering indexes have analogs for branching path expression queries over tree- or graph-structured XML data. Our answer is yes --- the forward-and-backward index already proposed in the literature can be viewed as a structure analogous to a summary table or covering index. We also show that it is the smallest such index that covers all branching path expression queries. While this index is very general, our experiments show that it can be so large in practice as to offer little performance improvement over evaluating queries directly on the data. Likening the forward-and-backward index to a covering index on all the attributes of several tables, we devise an index definition scheme to restrict the class of branching path expressions being indexed. The resulting index structures are dramatically smaller and perform better than the full forward-and-backward index for these classes of branching path expressions. This is roughly analogous to the situation in multidimensional or OLAP workloads, in which more highly aggregated summary tables can service a smaller subset of queries but can do so at increased performance. We evaluate the performance of our indexes on both relational decompositions of XML and a native storage technique. As expected, the performance benefit of an index is maximized when the query matches the index definition. SIGMOD Conference Skew handling techniques in sort-merge join. Wei Li,Dengfeng Gao,Richard T. Snodgrass 2002 Joins are among the most frequently executed operations. Several fast join algorithms have been developed and extensively studied; these can be categorized as sort-merge, hash-based, and index-based algorithms. While all three types of algorithms exhibit excellent performance over most data, ameliorating the performance degradation in the presence of skew has been investigated only for hash-based algorithms. However, for sort-merge join, even a small amount of skew present in realistic data can result in a significant performance hit on a commercial DBMS. This paper examines the negative ramifications of skew in sort-merge join and proposes several refinements that deal effectively with data skew. Experiments show that some of these algorithms also impose virtually no penalty in the absence of data skew and are thus suitable for replacing existing sort-merge implementations. We also show how sort-merge band join performance is significantly enhanced with these refinements. SIGMOD Conference XBase: making your gigabyte disk queriable. Hongjun Lu,Guoren Wang,Ge Yu,Yubin Bao,Jianhua Lv,Yaxin Yu 2002 With the rapid development of the Internet and the World Wide Web (WWW), very large amount of information is available and ready for downloading, most of which are free of charge. At the same time, hard disks with large capacity are available at affordable prices. Most of us nowadays often dump a large number of various types of documents into our computers without much thinking. On the other hand, file systems have not changed too much during the past decades. Most of them organize files in directories that form a tree structure, and a file is identified by its name and pathname in the directory tree. Remembering name of files created sometime ago and digging them out from a disk with dozen gigabytes of data in hundred thousands of files becomes never an easy task. Tools available for helping such a search are still far from satisfactory.Xbase (XML-based document BASE) is a prototype system aiming at addressing the above problem. By XML-based, we meant that XML is used to define the metadata. The current version of XBase stores text-based files, including semi-structured data such as XML, HTML, plain text documents (e.g., tex files, computer programs) and those files that can be converted into text (e.g., postscript files, PDF files). In XBase, file name is optional. Users can just load a file into XBase without giving a name and the directory where it should be stored. XBase will automatically associate it with attributes such as the time when the file was saved, its source, its size and type, and etc., To retrieve those files, XBase provides three access methods, explorative browsing, querying using query languages, and keyword based search. SIGMOD Conference A scalable hash ripple join algorithm. Gang Luo,Curt J. Ellmann,Peter J. Haas,Jeffrey F. Naughton 2002 Recently, Haas and Hellerstein proposed the hash ripple join algorithm in the context of online aggregation. Although the algorithm rapidly gives a good estimate for many join-aggregate problem instances, the convergence can be slow if the number of tuples that satisfy the join predicate is small or if there are many groups in the output. Furthermore, if memory overflows (for example, because the user allows the algorithm to run to completion for an exact answer), the algorithm degenerates to block ripple join and performance suffers. In this paper, we build on the work of Haas and Hellerstein and propose a new algorithm that (a) combines parallelism with sampling to speed convergence, and (b) maintains good performance in the presence of memory overflow. Results from a prototype implementation in a parallel DBMS show that its rate of convergence scales with the number of processors, and that when allowed to run to completion, even in the presence of memory overflow, it is competitive with the traditional parallel hybrid hash join algorithm. SIGMOD Conference Middle-tier database caching for e-business. Qiong Luo,Sailesh Krishnamurthy,C. Mohan,Hamid Pirahesh,Honguk Woo,Bruce G. Lindsay,Jeffrey F. Naughton 2002 "While scaling up to the enormous and growing Internet population with unpredictable usage patterns, E-commerce applications face severe challenges in cost and manageability, especially for database servers that are deployed as those applications' backends in a multi-tier configuration. Middle-tier database caching is one solution to this problem. In this paper, we present a simple extension to the existing federated features in DB2 UDB, which enables a regular DB2 instance to become a DBCache without any application modification. On deployment of a DBCache at an application server, arbitrary SQL statements generated from the unchanged application that are intended for a backend database server, can be answered: at the cache, at the backend database server, or at both locations in a distributed manner. The factors that determine the distribution of workload include the SQL statement type, the cache content, the application requirement on data freshness, and cost-based optimization at the cache. We have developed a research prototype of DBCache, and conducted an extensive set of experiments with an E-Commerce benchmark to show the benefits of this approach and illustrate tradeoffs in caching considerations." SIGMOD Conference Distributing queries over low-power wireless sensor networks. Samuel Madden,Joseph M. Hellerstein 2002 Distributing queries over low-power wireless sensor networks. SIGMOD Conference Continuously adaptive continuous queries over streams. Samuel Madden,Mehul A. Shah,Joseph M. Hellerstein,Vijayshankar Raman 2002 We present a continuously adaptive, continuous query (CACQ) implementation based on the eddy query processing framework. We show that our design provides significant performance benefits over existing approaches to evaluating continuous queries, not only because of its adaptivity, but also because of the aggressive cross-query sharing of work and space that it enables. By breaking the abstraction of shared relational algebra expressions, our Telegraph CACQ implementation is able to share physical operators --- both selections and join state --- at a very fine grain. We augment these features with a grouped-filter index to simultaneously evaluate multiple selection predicates. We include measurements of the performance of our core system, along with a comparison to existing continuous query approaches. SIGMOD Conference Learning table access cardinalities with LEO. Volker Markl,Guy M. Lohman 2002 LEO is a comprehensive way to repair incorrect statistics and cardinality estimates of a query execution plan. LEO introduces a feedback loop to query optimization that enhances the available information on the database where the most queries have occurred, allowing the optimizer to actually learn from its past mistakes. We demonstrate how LEO learns outdated table access statistics on a TPC-H like database schema and show that LEO improves the estimates for table cardinalities as well as filter factors for single predicates. Thus LEO enables the query optimizer to choose a better query execution plan, resulting in more efficient query processing. We not only demonstrate learning by repetitive execution of a single query, but also illustrate how similar, but not identical queries benefit from learned knowledge. In addition, we show the effect of both learning cardinalities and adjusting related statistics. SIGMOD Conference Tutorial: application servers and associated technologies. C. Mohan 2002 Application Servers (ASs), which have become very popular in the last few years, provide the platforms for the execution of transactional, server-side applications in the online world. ASs are the modern cousins of traditional transaction processing monitors (TPMs) like CICS. In this tutorial, I will provide an introduction to different ASs and their technologies. ASs play a central role in enabling electronic commerce in the web context. They are built on the basis of more standardized protocols and APIs than were the traditional TPMs. The emergence of Java, XML and OMG standards has played a significant role in this regard. Consequently, I will also briefly introduce the related XML, Java and OMG technologies like SOAP, J2EE and CORBA. One of the most important features of ASs is their ability to integrate the modern application environments with legacy data sources like IMS, CICS, VSAM, etc. They provide a number of connectors for this purpose, typically using asynchronous transactional messaging technologies like MQSeries and JMS. Traditional TPM-style requirements for industrial strength features like scalability, availability, reliability and high performance are equally important for ASs also. Security and authentication issues are additional important requirements in the web context. ASs support DBMSs not only as storage engines for user data but also as repositories for tracking their own state. Recently, the ECPerf benchmark has been developed via the Java Community Process to evaluate in a standardized way the cost performance of J2EE-compliant ASs. Several caching technologies have been developed to improve performance of ASs.Soon after this conference is over, the slides of this tutorial will be available on the web at the following URL: http://www.almaden.ibm.com/u/mohan/AppServersTutorial_SIGMOD2002_Slides.pdf SIGMOD Conference General match: a subsequence matching method in time-series databases based on generalized windows. Yang-Sae Moon,Kyu-Young Whang,Wook-Shin Han 2002 We generalize the method of constructing windows in subsequence matching. By this generalization, we can explain earlier subsequence matching methods as special cases of a common framework. Based on the generalization, we propose a new subsequence matching method, General Match. The earlier work by Faloutsos et al. (called FRM for convenience) causes a lot of false alarms due to lack of point-filtering effect. Dual Match, recently proposed as a dual approach of FRM, improves performance significantly over FRM by exploiting point filtering effect. However, it has the problem of having a smaller allowable window size---half that of FRM---given the minimum query length. A smaller window increases false alarms due to window size effect. General Match offers advantages of both methods: it can reduce window size effect by using large windows like FRM and, at the same time, can exploit point-filtering effect like Dual Match. General Match divides data sequences into generalized sliding windows (J-sliding windows) and the query sequence into generalized disjoint windows (J-disjoint windows). We formally prove that General Match is correct, i.e., it incurs no false dismissal. We then propose a method of estimating the optimal value of the sliding factor J that minimizes the number of page accesses. Experimental results for real stock data show that, for low selectivities (10-6∼10-4), General Match improves average performance by 117% over Dual Match and by 998% over FRM; for high selectivities (10-3∼10-1), by 45% over Dual Match and by 64% over FRM. The proposed generalization provides an excellent theoretical basis for understanding the underlying mechanisms of subsequence matching. SIGMOD Conference Best-effort cache synchronization with source cooperation. Chris Olston,Jennifer Widom 2002 In environments where exact synchronization between source data objects and cached copies is not achievable due to bandwidth or other resource constraints, stale (out-of-date) copies are permitted. It is desirable to minimize the overall divergence between source objects and cached copies by selectively refreshing modified objects. We call the online process of selecting which objects to refresh in order to minimize divergence best-effort synchronization. In most approaches to best-effort synchronization, the cache coordinates the process and selects objects to refresh. In this paper, we propose a best-effort synchronization scheduling policy that exploits cooperation between data sources and the cache. We also propose an implementation of our policy that incurs low communication overhead even in environments with very large numbers of sources. Our algorithm is adaptive to wide fluctuations in available resources and data update rates. Through experimental simulation over synthetic and real-world data, we demonstrate the effectiveness of our algorithm, and we quantify the significant decrease in divergence achievable with source cooperation. SIGMOD Conference QURSED: querying and reporting semistructured data. Yannis Papakonstantinou,Michalis Petropoulos,Vasilis Vassalos 2002 QURSED enables the development of web-based query forms and reports (QFRs) that query and report semistructured XML data, i.e., data that are characterized by nesting, irregularities and structural variance. The query aspects of a QFR are captured by its query set specification, which formally encodes multiple parameterized condition fragments and can describe large numbers of queries. The run-time component of QURSED produces XQuery-compliant queries by synthesizing fragments from the query set specification that have been activated during the interaction of the end-user with the QFR. The design-time component of QURSED, called QURSED Editor, semi-automates the development of the query set specification and its association with the visual components of the QFR by translating visual actions into appropriate query set specifications. We describe QURSED and illustrate how it accommodates the intricacies that the semistructured nature of the underlying database introduces. We specifically focus on the formal model of the query set specification, its generation via the QURSED Editor and its coupling with the visual aspects of the web-based form and report. SIGMOD Conference TPC-DS, taking decision support benchmarking to the next level. Meikel Pöss,Bryan Smith,Lubor Kollár,Per-Åke Larson 2002 TPC-DS is a new decision support benchmark currently under development by the Transaction Processing Performance Council (TPC). This paper provides a brief overview of the new benchmark. The benchmark models the decision support functions of a retail product supplier, including data loading, multiple types of queries and data maintenance. The database consists of multiple snowflake schemas with shared dimension tables; data is skewed; and the query set is large. Overall, the benchmark is considerably more realistic than previous decision support benchmarks. SIGMOD Conference A Monte Carlo algorithm for fast projective clustering. Cecilia Magdalena Procopiuc,Michael Jones,Pankaj K. Agarwal,T. M. Murali 2002 We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good with high probability. We implemented a modified version of the algorithm, using heuristics to speed up computation. Our extensive experiments show that our method is significantly more accurate than previous approaches. In particular, we use our techniques to build a classifier for detecting rotated human faces in cluttered images. SIGMOD Conference Partial results for online query processing. Vijayshankar Raman,Joseph M. Hellerstein 2002 Traditional query processors generate full, accurate query results, either in batch or in pipelined fashion. We argue that this strict model is too rigid for exploratory queries over diverse and distributed data sources, such as sources on the Internet. Instead, we propose a looser model of querying in which a user submits a broad initial query outline, and the system continually generates partial result tuples that may contain values for only some of the output fields. The user can watch these partial results accumulate at the user interface, and accordingly refine the query by specifying their interest in different kinds of partial results.After describing our querying model and user interface, we present a query processing architecture for this model which is implemented in the Telegraph dataflow system. Our architecture is designed to generate partial results quickly, and to adapt query execution to changing user interests. The crux of this architecture is a dataflow operator that supports two kinds of reorderings: reordering of intermediate tuples within a dataflow, and reordering of query plan operators through which tuples flow. We study reordering policies that optimize for the quality of partial results delivered over time, and experimentally demonstrate the benefits of our architecture in this context. SIGMOD Conference Efficient algorithms for minimizing tree pattern queries. Prakash Ramanan 2002 We consider the problem of minimizing tree pattern queries (TPQ) that arise in XML and in LDAP-style network directories. In [Minimization of Tree Pattern Queries, Proc. ACM SIGMOD Intl. Conf. Management of Data, 2001, pp. 497-508], Amer-Yahia, Cho, Lakshmanan and Srivastava presented an O(n4) algorithm for minimizing TPQs in the absence of integrity constraints (Case 1); n is the number of nodes in the query. Then they considered the problem of minimizing TPQs in the presence of three kinds of integrity constraints: required-child, required-descendant and subtype (Case 2). They presented an O(n6) algorithm for minimizing TPQs in the presence of only required-child and required-descendant constraints (i.e., no subtypes allowed; Case 3). We present O(n2), O(n4) and O(n2) algorithms for minimizing TPQs in these three cases, respectively, based on the concept of graph simulation. We believe that our O(n2) algorithms for Cases 1 and 3 are runtime optimal. SIGMOD Conference GEA: a toolkit for gene expression analysis. Jessica M. Phan,Raymond T. Ng 2002 Currently gene expression data are being produced at a phenomenal rate. The general objective is to try to gain a better understanding of the functions of cellular tissues. In particular, one specific goal is to relate gene expression to cancer diagnosis, prognosis and treatment. However, a key obstacle is that the availability of analysis tools or lack thereof, impedes the use of the data, making it difficult for cancer researchers to perform analysis efficiently and effectively. SIGMOD Conference Automating physical database design in a parallel database. Jun Rao,Chun Zhang,Nimrod Megiddo,Guy M. Lohman 2002 "Physical database design is important for query performance in a shared-nothing parallel database system, in which data is horizontally partitioned among multiple independent nodes. We seek to automate the process of data partitioning. Given a workload of SQL statements, we seek to determine automatically how to partition the base data across multiple nodes to achieve overall optimal (or close to optimal) performance for that workload. Previous attempts use heuristic rules to make those decisions. These approaches fail to consider all of the interdependent aspects of query performance typically modeled by today's sophisticated query optimizers.We present a comprehensive solution to the problem that has been tightly integrated with the optimizer of a commercial shared-nothing parallel database system. Our approach uses the query optimizer itself both to recommend candidate partitions for each table that will benefit each query in the workload, and to evaluate various combinations of these candidates. We compare a rank-based enumeration method with a random-based one. Our experimental results show that the former is more effective." SIGMOD Conference Statistical synopses for graph-structured XML databases. Neoklis Polyzotis,Minos N. Garofalakis 2002 Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows path navigation and branching through the XML data graph in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, We introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSKETCH) exploits localized graph stability to accurately approximate (in limited space) the path and branching distribution in the data graph. To estimate the selectivities of complex path expressions over concise XSKETCH synopses, we develop an estimation framework that relies on appropriate statistical (uniformity and independence) assumptions to compensate for the lack of detailed distribution information. Given our estimation framework, we demonstrate that the problem of building an accuracy-optimal XSKETCH for a given amount of space is 풩풫-hard, and propose an efficient heuristic algorithm based on greedy forward selection. Briefly, our algorithm constructs an XSKETCH synopsis by successive refinements of the label-split graph, the coarsest summary of the XML data graph. Our refinement operations act locally and attempt to capture important statistical correlations between data paths. Extensive experimental results with synthetic as well as real-life data sets verify the effectiveness of our approach. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured data and complex (branching) path expressions. SIGMOD Conference Efficient integration and aggregation of historical information. Mirek Riedewald,Divyakant Agrawal,Amr El Abbadi 2002 Data warehouses support the analysis of historical data. This often involves aggregation over a period of time. Furthermore, data is typically incorporated in the warehouse in the increasing order of a time attribute, e.g., date of sale or time of a temperature measurement. In this paper we propose a framework to take advantage of this append only nature of updates due to a time attribute. The framework allows us to integrate large amounts of new data into the warehouse and generate historical summaries efficiently. Query and update costs are virtually independent from the extent of the data set in the time dimension, making our framework an attractive aggregation approach for append-only data streams. A specific instantiation of the general approach is developed for MOLAP data cubes, involving a new data structure for append-only arrays with pre-aggregated values. Our framework is applicable to point data and data with extent, e.g., hyper-rectangles. SIGMOD Conference XmdvTool: visual interactive data exploration and trend discovery of high-dimensional data sets. Elke A. Rundensteiner,Matthew O. Ward,Jing Yang,Punit R. Doshi 2002 XmdvTool: visual interactive data exploration and trend discovery of high-dimensional data sets. SIGMOD Conference Database tuning: principles, experiments, and troubleshooting techniques (part II). Dennis Shasha,Philippe Bonnet 2002 Database tuning: principles, experiments, and troubleshooting techniques (part II). SIGMOD Conference Database tuning: principles, experiments, and troubleshooting techniques (part I). Dennis Shasha,Philippe Bonnet 2002 Database tuning: principles, experiments, and troubleshooting techniques (part I). SIGMOD Conference The SDSS skyserver: public access to the sloan digital sky server data. Alexander S. Szalay,Jim Gray,Ani Thakar,Peter Z. Kunszt,Tanu Malik,Jordan Raddick,Christopher Stoughton,Jan vandenBerg 2002 The SkyServer provides Internet access to the public Sloan Digital Sky Survey (SDSS) data for both astronomers and for science education. This paper describes the SkyServer goals and architecture. It also describes our experience operating the SkyServer on the Internet. The SDSS data is public and well-documented so it makes a good test platform for research on database algorithms and performance. SIGMOD Conference Dwarf: shrinking the PetaCube. Yannis Sismanis,Antonios Deligiannakis,Nick Roussopoulos,Yannis Kotidis 2002 Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery,in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions.This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far out-perform these techniques on all counts: storage space, creation time, query response time, and updates of cubes. SIGMOD Conference Time-parameterized queries in spatio-temporal databases. Yufei Tao,Dimitris Papadias 2002 Time-parameterized queries (TP queries for short) retrieve (i) the actual result at the time that the query is issued, (ii) the validity period of the result given the current motion of the query and the database objects, and (iii) the change that causes the expiration of the result. Due to the highly dynamic nature of several spatio-temporal applications, TP queries are important both as standalone methods, as well as building blocks of more complex operations. However, little work has been done towards their efficient processing. In this paper, we propose a general framework that covers time-parameterized variations of the most common spatial queries, namely window queries, k-nearest neighbors and spatial joins. In particular, each of these TP queries is reduced to nearest neighbor search where the distance functions are defined according to the query type. This reduction allows the application and extension of well-known branch and bound techniques to the current problem. The proposed methods can be applied with mobile queries, mobile objects or both, given a suitable indexing method. Our experimental evaluation is based on R-trees and their extensions for dynamic objects. SIGMOD Conference Storing and querying ordered XML using a relational database system. Igor Tatarinov,Stratis Viglas,Kevin S. Beyer,Jayavel Shanmugasundaram,Eugene J. Shekita,Chun Zhang 2002 "XML is quickly becoming the de facto standard for data exchange over the Internet. This is creating a new set of data management requirements involving XML, such as the need to store and query XML documents. Researchers have proposed using relational database systems to satisfy these requirements by devising ways to ""shred"" XML documents into relations, and translate XML queries into SQL queries over these relations. However, a key issue with such an approach, which has largely been ignored in the research literature, is how (and whether) the ordered XML data model can be efficiently supported by the unordered relational data model. This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system. This is accomplished by encoding order as a data value. We propose three order encoding methods that can be used to represent XML order in the relational data model, and also propose algorithms for translating ordered XPath expressions into SQL using these encoding methods. Finally, we report the results of an experimental study that investigates the performance of the proposed order encoding methods on a workload of ordered XML queries and updates." SIGMOD Conference Mid-tier caching: the TimesTen approach. Times-Ten Team 2002 TimesTen is an in-memory, application-tier data manager that delivers low response time and high throughput. Applications may create tables and manage them exclusively in TimesTen, and they may optionally cache frequently used subsets of a disk-based relational database in TimesTen. Cached tables and tables managed exclusively by TimesTen may coexist in the same database. Queries and updates to the cache are performed by the application through SQL. Applications running on different mid-tier servers may cache different or overlapping subsets of the same back-end database. TimesTen keeps the caches synchronized with each other and with the back-end database. SIGMOD Conference Dynamic multidimensional histograms. Nitin Thaper,Sudipto Guha,Piotr Indyk,Nick Koudas 2002 Histograms are a concise and flexible way to construct summary structures for large data sets. They have attracted a lot of attention in database research due to their utility in many areas, including query optimization, and approximate query answering. They are also a basic tool for data visualization and analysis.In this paper, we present a formal study of dynamic multidimensional histogram structures over continuous data streams. At the heart of our proposal is the use of a dynamic summary data structure (vastly different from a histogram) maintaining a succinct approximation of the data distribution of the underlying continuous stream. On demand, an accurate histogram is derived from this dynamic data structure. We propose algorithms for extracting such an accurate histogram and we analyze their behavior and tradeoffs. The proposed algorithms are able to provide approximate guarantees about the quality of the estimation of the histograms they extract.We complement our analytical results with a thorough experimental evaluation using real data sets. SIGMOD Conference The XXL search engine: ranked retrieval of XML data using indexes and ontologies. Anja Theobald,Gerhard Weikum 2002 The XXL search engine: ranked retrieval of XML data using indexes and ontologies. SIGMOD Conference Rate-based query optimization for streaming information sources. Stratis Viglas,Jeffrey F. Naughton 2002 Relational query optimizers have traditionally relied upon table cardinalities when estimating the cost of the query plans they consider. While this approach has been and continues to be successful, the advent of the Internet and the need to execute queries over streaming sources requires a different approach, since for streaming inputs the cardinality may not be known or may not even be knowable (as is the case for an unbounded stream.) In view of this, we propose shifting from a cardinality-based approach to a rate-based approach, and give an optimization framework that aims at maximizing the output rate of query evaluation plans. This approach can be applied to cases where the cardinality-based approach cannot be used. It may also be useful for cases where cardinalities are known, because by focusing on rates we are able not only to optimize the time at which the last result tuple appears, but also to optimize for the number of answers computed at any specified time after the query evaluation commences. We present a preliminary validation of our rate-based optimization framework on a prototype XML query engine, though it is generic enough to be used in other database contexts. The results show that rate-based optimization is feasible and can indeed yield correct decisions. SIGMOD Conference Efficient k-NN search on vertically decomposed data. Arjen P. de Vries,Nikos Mamoulis,Niels Nes,Martin L. Kersten 2002 Applications like multimedia retrieval require efficient support for similarity search on large data collections. Yet, nearest neighbor search is a difficult problem in high dimensional spaces, rendering efficient applications hard to realize: index structures degrade rapidly with increasing dimensionality, while sequential search is not an attractive solution for repositories with millions of objects. This paper approaches the problem from a different angle. A solution is sought in an unconventional storage scheme, that opens up a new range of techniques for processing k-NN queries, especially suited for high dimensional spaces. The suggested (physical) database design accommodates well a novel variant of branch-and-bound search, that reduces the high dimensional space quickly to a small candidate set. The paper provides insight in applying this idea to k-NN search using two similarity metrics commonly encountered in image database applications, and discusses techniques for its implementation in relational database systems. The effectiveness of the proposed method is evaluated empirically on both real and synthetic data sets, reporting the significant improvements in response time yielded. SIGMOD Conference COMMIX: towards effective web information extraction, integration and query answering. Tengjiao Wang,Shiwei Tang,Dongqing Yang,Jun Gao,Yuqing Wu,Jian Pei 2002 As WWW becomes more and more popular and powerful, how to search information on the web in database way becomes an important research topic. COMMIX, which is developed in the DB group in Peking University (China), is a system towards building very large database using data from the Web for information extraction, integration and query answering. COMMIX has some innovative features, such as ontology-based wrapper generation, XML-based information integration, view-based query answering, and QBE-style XML query interface. SIGMOD Conference Clustering by pattern similarity in large data sets. Haixun Wang,Wei Wang,Jiong Yang,Philip S. Yu 2002 Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. E-commerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness. SIGMOD Conference Efficient execution of joins in a star schema. Andreas Weininger 2002 A star schema is very popular for modeling data warehouses and data marts. Therefore, it is important that a database system which is used for implementing such a data warehouse or data mart is able to efficiently handle operations on such a schema. In this paper we will describe how one of these operations, the join operation --- probably the most important operation --- is implemented in the IBM Informix Extended Parallel Server (XPS). SIGMOD Conference Mining long sequential patterns in a noisy environment. Jiong Yang,Wei Wang,Philip S. Yu,Jiawei Han 2002 "Pattern discovery in long sequences is of great importance in many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner. Unfortunately, the support measure (i.e., the number of occurrences) of a pattern does not serve this purpose. In this paper, we introduce the concept of compatibility matrix as the means to provide a probabilistic connection from the observation to the underlying true value. A new metric match is also proposed to capture the ""real support"" of a pattern which would be expected if a noise-free environment is assumed. In addition, in the context we address, a pattern could be very long. The standard pruning technique developed for the market basket problem may not work efficiently. As a result, a novel algorithm that combines statistical sampling and a new technique (namely border collapsing) is devised to discover long patterns in a minimal number of scans of the sequence database with sufficiently high confidence. Empirical results demonstrate the robustness of the match model (with respect to the noise) and the efficiency of the probabilistic algorithm." SIGMOD Conference Efficient evaluation of queries in a mediator for WebSources. Vladimir Zadorozhny,Louiqa Raschid,Maria-Esther Vidal,Tolga Urhan,Laura Bright 2002 We consider an architecture of mediators and wrappers for Internet accessible WebSources of limited query capability. Each call to a source is a WebSource Implementation (WSI) and it is associated with both a capability and (a possibly dynamic) cost. The multiplicity of WSIs with varying costs and capabilities increases the complexity of a traditional optimizer that must assign WSIs for each remote relation in the query while generating an (optimal) plan. We present a two-phase Web Query Optimizer (WQO). In a pre-optimization phase, the WQO selects one or more WSIs for a pre-plan; a pre-plan represents a space of query evaluation plans (plans) based on this choice of WSIs. The WQO uses cost-based heuristics to evaluate the choice of WSI assignment in the pre-plan and to choose a good pre-plan. The WQO uses the pre-plan to drive the extended relational optimizer to obtain the best plan for a pre-plan. A prototype of the WQO has been developed. We compare the effectiveness of the WQO, i.e., its ability to efficiently search a large space of plans and obtain a low cost plan, in comparison to a traditional optimizer. We also validate the cost-based heuristics by experimental evaluation of queries in the noisy Internet environment. SIGMOD Conference Rainbow: mapping-driven XQuery processing system. Xin Zhang,Mukesh Mulchandani,Steffen Christ,Brian Murphy,Elke A. Rundensteiner 2002 Rainbow: mapping-driven XQuery processing system. SIGMOD Conference Implementing database operations using SIMD instructions. Jingren Zhou,Kenneth A. Ross 2002 Modern CPUs have instructions that allow basic operations to be performed on several data elements in parallel. These instructions are called SIMD instructions, since they apply a single instruction to multiple data elements. SIMD technology was initially built into commodity processors in order to accelerate the performance of multimedia applications. SIMD instructions provide new opportunities for database engine design and implementation. We study various kinds of operations in a database context, and show how the inner loop of the operations can be accelerated using SIMD instructions. The use of SIMD instructions has two immediate performance benefits: It allows a degree of parallelism, so that many operands can be processed at once. It also often leads to the elimination of conditional branch instructions, reducing branch mispredictions.We consider the most important database operations, including sequential scans, aggregation, index operations, and joins. We present techniques for implementing these using SIMD instructions. We show that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of SIMD technology. Our study shows that using a SIMD parallelism of four, the CPU time for the new algorithms is from 10% to more than four times less than for the traditional algorithms. Superlinear speedups are obtained as a result of the elimination of branch misprediction effects. SIGMOD Conference Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002 Michael J. Franklin,Bongki Moon,Anastassia Ailamaki 2002 Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, June 3-6, 2002 VLDB Watermarking Relational Databases. Rakesh Agrawal,Jerry Kiernan 2002 We enunciate the need for watermarking database relations to deter their piracy, identify the unique characteristics of relational data which pose new challenges for watermarking, and provide desirable properties of a watermarking system for relational data. A watermark can be applied to any database relation having attributes which are such that changes in a few of their values do not affect the applications. We then present an effective watermarking technique geared for relational data. This technique ensures that some bit positions of some of the attributes of some of the tuples contain specific values. The tuples, attributes within a tuple, bit positions in an attribute, and specific bit values are all algorithmically determined under the control of a private key known only to the owner of the data. This bit pattern constitutes the watermark. Only if one has access to the private key can the watermark be detected with high probability. Detecting the watermark neither requires access to the original data nor the watermark. The watermark can be detected even in a small subset of a watermarked relation as long as the sample contains some of the marks. Our extensive analysis shows that the proposed technique is robust against various forms of malicious attacks and updates to the data. Using an implementation running on DB2, we also show that the performance of the algorithms allows for their use in real world applications. VLDB Hippocratic Databases. Rakesh Agrawal,Jerry Kiernan,Ramakrishnan Srikant,Yirong Xu 2002 The Hippocratic Oath has guided the conduct of physicians for centuries. Inspired by its tenet of preserving privacy, we argue that future database systems must include responsibility for the privacy of data they manage as a founding tenet. We enunciate the key privacy principles for such Hippocratic database systems. We propose a strawman design for Hippocratic databases, identify the technical challenges and problems in designing such databases, and suggest some approaches that may lead to solutions. Our hope is that this paper will serve to catalyze a fruitful and exciting direction for future database research. VLDB BANKS: Browsing and Keyword Searching in Relational Databases. B. Aditya,Gaurav Bhalotia,Soumen Chakrabarti,Arvind Hulgeri,Charuta Nakhe,Parag,S. Sudarshan 2002 The BANKS system enables keyword-based search on databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. Extensive support for answer ranking forms a critical part of the BANKS system. VLDB COMA - A System for Flexible Combination of Schema Matching Approaches. Hong Hai Do,Erhard Rahm 2002 Schema matching is the task of finding semantic correspondences between elements of two schemas. It is needed in many database applications, such as integration of web data sources, data warehouse loading and XML message mapping. To reduce the amount of user effort as much as possible, automatic approaches combining several match techniques are required. While such match approaches have found considerable interest recently, the problem of how to best combine different match algorithms still requires further work. We have thus developed the COMA schema matching system as a platform to combine multiple matchers in a flexible way. We provide a large spectrum of individual matchers, in particular a novel approach aiming at reusing results from previous match operations, and several mechanisms to combine the results of matcher executions. We use COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas. The results obtained so far show the superiority of combined match approaches and indicate the high value of reuse-oriented strategies. VLDB Active XML: Peer-to-Peer Data and Web Services Integration. Serge Abiteboul,Omar Benjelloun,Ioana Manolescu,Tova Milo,Roger Weber 2002 Active XML: Peer-to-Peer Data and Web Services Integration. VLDB Database Technologies for Electronic Commerce. Rakesh Agrawal,Ramakrishnan Srikant,Yirong Xu 2002 Database Technologies for Electronic Commerce. VLDB Toward Recovery-Oriented Computing. Armando Fox 2002 Recovery Oriented Computing (ROC) is a joint research effort between Stanford University and the University of California, Berkeley. ROC takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved. This perspective is supported both by historical evidence and by recent studies on the main sources of outages in production systems. By concentrating on reducing Mean Time to Repair (MTTR) rather than increasing Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. We describe the principles and philosophy behind the joint Stanford/Berkeley ROC effort and outline some of its research areas and current projects. VLDB An Automated System for Web Portal Personalization. Charu C. Aggarwal,Philip S. Yu 2002 This paper proposes a system for personalization of web portals. A specic implementation is discussed in reference to a web portal containing a news feed service. Techniques are proposed for effective categorization, management, and personalization of news feeds obtained from a live news wire service. The process consists of two steps: first manual input is required to build the domain knowledge which could be site-specific; then the automated component uses this domain knowledge in order to perform the personalization, categorization and presentation. Effective schemes for advertising are proposed, where the targeting is done using both the information about the user and the content of the web page on which the advertising icon appears. Automated techniques for identifying sudden variations in news patterns are described; these may be used for supporting news-alerts. A description of a version of this software for our customer web site is provided. VLDB "OBK - An Online High Energy Physics' Meta-Data Repository." I. Alexandrov,Antonio Amorim,E. Badescu,M. Barczyk,D. Burckhart-Chromek,M. Caprini,M. Dobson,J. Flammer,R. Hart,R. Jones,A. Kazarov,S. Kolos,V. Kotov,Dietrich Liko,Levi Lucio,L. Mapelli,M. Mineev,L. Moneta,I. Papadopoulos,M. Nassiakou,N. Parrington,Luis Pedro,A. Ribeiro,Yu. Ryabov,D. Schweiger,I. Soloviev,H. Wolters 2002 "OBK - An Online High Energy Physics' Meta-Data Repository." VLDB Eliminating Fuzzy Duplicates in Data Warehouses. Rohit Ananthakrishna,Surajit Chaudhuri,Venkatesh Ganti 2002 The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse. VLDB Adaptable Similarity Search using Non-Relevant Information. T. V. Ashwin,Rahul Gupta,Sugata Ghosal 2002 "Many modern database applications require content-based similarity search capability in numeric attribute space. Further, users' notion of similarity varies between search sessions. Therefore online techniques for adaptively refining the similarity metric based on relevance feedback from the user are necessary. Existing methods use retrieved items marked relevant by the user to refine the similarity metric, without taking into account the information about non-relevant (or unsatisfactory) items. Consequently items in database close to non-relevant ones continue to be retrieved in further iterations. In this paper a robust technique is proposed to incorporate non-relevant information to efficiently discover the feasible search region. A decision surface is determined to split the attribute space into relevant and nonrelevant regions. The decision surface is composed of hyperplanes, each of which is normal to the minimum distance vector from a nonrelevant point to the convex hull of the relevant points. A similarity metric, estimated using the relevant objects is used to rank and retrieve database objects in the relevant region. Experiments on simulated and benchmark datasets demonstrate robustness and superior performance of the proposed technique over existing adaptive similarity search techniques." VLDB Provisions and Obligations in Policy Management and Security Applications. Claudio Bettini,Sushil Jajodia,Xiaoyang Sean Wang,Duminda Wijesekera 2002 "Policies are widely used in many systems and applications. Recently, it has been recognized that a ""yes/no"" response to every scenario is just not enough for many modern systems and applications. Many policies require certain conditions to be satisfied and actions to be performed before or after a decision is made. To address this need, this paper introduces the notions of provisions and obligations. Provisions are those conditions that need to be satisfied or actions that must be performed before a decision is rendered, while obligations are those conditions or actions that must be fulfilled by either the users or the system after the decision. This paper formalizes a rule-based policy framework that includes provisions and obligations, and investigates a reasoning mechanism within this framework. A policy decision may be supported by more than one derivation, each associated with a potentially different set of provisions and obligations (called a global PO set). The reasoning mechanism can derive all the global PO sets for each specific policy decision, and facilitates the selection of the best one based on numerical weights assigned to provisions and obligations as well as on semantic relationships among them. The paper also shows the use of the proposed policy framework in a security application." VLDB DTD-Directed Publishing with Attribute Translation Grammars. Michael Benedikt,Chee Yong Chan,Wenfei Fan,Rajeev Rastogi,Shihui Zheng,Aoying Zhou 2002 We present a framework for publishing relational data in XML with respect to a fixed DTD. In data exchange on the Web, XML views of relational data are typically required to conform to a predefined DTD. The presence of recursion in a DTD as well as non-determinism makes it challenging to generate DTD-directed, efficient transformations. Our framework provides a language for defining views that are guaranteed to be DTD-conformant, as well as middleware for evaluating these views. It is based on a novel notion of attribute translation grammars (ATGs). An ATG extends a DTD by associating semantic rules via SQL queries. Directed by the DTD, it extracts data from a relational database, and constructs an XML document. We provide algorithms for efficiently evaluating ATGs, along with methods for statically analyzing them. This yields a systematic and effective approach to publishing data with respect to a predefined DTD. VLDB LegoDB: Customizing Relational Storage for XML Documents. Philip Bohannon,Juliana Freire,Jayant R. Haritsa,Maya Ramanath,Prasan Roy,Jérôme Siméon 2002 LegoDB: Customizing Relational Storage for XML Documents. VLDB An Almost-Serial Protocol for Transaction Execution in Main-Memory Database Systems. Stephen Blott,Henry F. Korth 2002 "Disk-based database systems benefit from concurrency among transactions - usually with marginal overhead. For main-memory database systems, however, locking overhead can have a serious impact on performance. This paper proposes SP, a serial protocol for the execution of transactions in main-memory systems, and evaluates its performance against that of strict two-phase locking. The novelty of SP lies in the use of timestamps and mutexes to allow one transaction to begin before its predecessors' commit records have been written to disk, while also ensuring that no committed transactions read uncommitted data. We demonstrate seven-fold and two-fold increases in maximum throughput for read-and update-intensive workloads, respectively. At fixed loads, we demonstrate ten-fold and two-fold improvements in response time for the same transaction mixes. We show that for a wide range of practical workloads, SP on a single processor outperforms locking on a multiprocessor, and then present a modified SP, that exploits multiprocessor systems." VLDB Optimizing View Queries in ROLEX to Support Navigable Result Trees. Philip Bohannon,Sumit Ganguly,Henry F. Korth,P. P. S. Narayan,Pradeep Shenoy 2002 An increasing number of applications use XML data published from relational databases. For speed and convenience, such applications routinely cache this XML data locally and access it through standard navigational interfaces such as DOM, sacrificing the consistency and integrity guarantees provided by a DBMS for speed. The ROLEX system is being built to extend the capabilities of relational database systems to deliver fast, consistent and navigable XML views of relational data to an application via a virtual DOM interface. This interface translates navigation operations on a DOM tree into execution-plan actions, allowing a spectrum of possibilities for lazy materialization. The ROLEX query optimizer uses a characterization of the navigation behavior of an application, and optimizes view queries to minimize the expected cost of that navigation. This paper presents the architecture of ROLEX, including its model of query execution and the query optimizer. We demonstrate with a performance study the advantages of the ROLEX approach and the importance of optimizing query execution for navigation. VLDB View Invalidation for Dynamic Content Caching in Multitiered Architectures. K. Selçuk Candan,Divyakant Agrawal,Wen-Syan Li,Oliver Po,Wang-Pin Hsiung 2002 "In today's multitiered application architectures, clients do not access data stored in the databases directly. Instead, they use applications which in turn invoke the DBMS to generate the relevant content. Since executing application programs may require significant time and other resources, it is more advantageous to cache application results in a result cache. Various view materialization and update management techniques have been proposed to deal with updates to the underlying data. These techniques guarantee that the cached results are always consistent with the underlying data. Several applications, including e-commerce sites, on the other hand, do not require the caches be consistent all the time. Instead, they require that all outdated pages in the caches are invalidated in a timely fashion. In this paper, we show that invalidation is inherently different from view maintenance. We develop algorithms that benefit from this difference in reducing the cost of update management in certain applications and we present an invalidation framework that benefits from these algorithms." VLDB Data Routing Rather than Databases: The Meaning of the Next Wave of the Web Revolution to Data Management. Adam Bosworth 2002 "What is going to be as important in the next 20 years as relational databases were in the prior 20 years is the management of self-describing extensible messages. The net is undergoing a profound change as it moves from an entirely pull-oriented model into a push model. This latter model is far more biological in nature with an increasing amount of information flowing asynchronously through the system to form an InformationBus. The key challenges for the next 20 years will be storing, routing, querying, filtering, managing, and interacting with this bus in a manner that doesn't lead to total systems degradation. Predictive intelligent filtering and rules engines will become more important than querying. Driving factors for this revolution will be the need for push for portable devices due to their poor latency and intermittent communication, an increasing demand for timely information on fully connected devices, a huge rise in application to application integration through asynchronous messaging based on web services and a concomitant requirement for an entirely new type of message broker, and an increasing desire for intelligent agents to cope with information overload as all information becomes available all the time. The key enabling technology will be XML messages and the various technologies that will develop for handling XML ranging from transformation to compression to indexing to storage to programming languages." VLDB Chip-Secured Data Access: Confidential Data on Untrusted Servers. Luc Bouganim,Philippe Pucheral 2002 "The democratization of ubiquitous computing (access data anywhere, anytime, anyhow), the increasing connection of corporate databases to the Internet and the today's natural resort to Web-hosting companies strongly emphasize the need for data confidentiality. Database servers arouse user's suspicion because no one can fully trust traditional security mechanisms against more and more frequent and malicious attacks and no one can be fully confident on an invisible DBA administering confidential data. This paper gives an in-depth analysis of existing security solutions and concludes on the intrinsic weakness of the traditional server-based approach to preserve data confidentiality. With this statement in mind, we propose a solution called C-SDA (Chip-Secured Data Access), which enforces data confidentiality and controls personal privileges thanks to a client-based security component acting as a mediator between a client and an encrypted database. This component is embedded in a smartcard to prevent any tampering to occur. This cooperation of hardware and software security components constitutes a strong guarantee against attacks threatening personal as well as business data." VLDB Monitoring Streams - A New Class of Data Management Applications. Donald Carney,Ugur Çetintemel,Mitch Cherniack,Christian Convey,Sangdon Lee,Greg Seidman,Michael Stonebraker,Nesime Tatbul,Stanley B. Zdonik 2002 This paper introduces monitoring applications, which we will show differ substantially from conventional business data processing. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS that is currently under construction at Brandeis University, Brown University, and M.I.T. We describe the basic system architecture, a stream-oriented set of operators, optimization tactics, and support for real-time operation. VLDB Using Latency-Recency Profiles for Data Delivery on the Web. Laura Bright,Louiqa Raschid 2002 An important challenge to web technologies such as proxy caching, web portals, and application servers is keeping cached data up-to-date. Clients may have different preferences for the latency and recency of their data. Some prefer the most recent data, others will accept stale cached data that can be delivered quickly. Existing approaches to maintaining cache consistency do not consider this diversity and may increase the latency of requests, consume excessive bandwidth, or both. Further, this overhead may be unnecessary in cases where clients will tolerate stale data that can be delivered quickly. This paper introduces latency-recency profiles, a set of parameters that allow clients to express preferences for their different applications. A cache or portal uses profiles to determine whether to deliver a cached object to the client or to download a fresh object from a remote server. We present an architecture for profiles that is both scalable and straightforward to implement at a cache. Experimental results using both synthetic and trace data show that profiles can reduce latency and bandwidth consumption compared to existing approaches, while still delivering fresh data in many cases. When there is insufficient bandwidth to answer all requests at once, profiles significantly reduce latencies for all clients. VLDB Searching and Mining Fine-Grained Semi-Structured Data. Soumen Chakrabarti 2002 Searching and Mining Fine-Grained Semi-Structured Data. VLDB Fast and Accurate Text Classification via Multiple Linear Discriminant Projections. Soumen Chakrabarti,Shourya Roy,Mahesh V. Soundalgekar 2002 "Abstract.Support vector machines (SVMs) have shown superb performance for text classification tasks. They are accurate, robust, and quick to apply to test instances. Their only potential drawback is their training time and memory requirement. For n training instances held in memory, the best-known SVM implementations take time proportional to n a, where a is typically between 1.8 and 2.1. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!-like directories. We present SIMPL, a nearly linear-time classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck. It uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively. It not only approaches and sometimes exceeds SVM accuracy, but also beats the running time of a popular SVM implementation by orders of magnitude. While describing SIMPL, we make a detailed experimental comparison of SVM-generated discriminants with Fisher's discriminants, and we also report on an analysis of the cache performance of a popular SVM implementation. Our analysis shows that SIMPL has the potential to be the method of choice for practitioners who want the accuracy of SVMs and the simplicity and speed of naive Bayes classifiers." VLDB Efficient Structural Joins on Indexed XML Documents. Shu-Yao Chien,Zografoula Vagena,Donghui Zhang,Vassilis J. Tsotras,Carlo Zaniolo 2002 Queries on XML documents typically combine selections on element contents, and, via path expressions, the structural relationships between tagged elements. Structural joins are used to find all pairs of elements satisfying the primitive structural relationships specified in the query, namely, parent-child and ancestor-descendant relationships. Efficient support for structural joins is thus the key to efficient implementations of XML queries. Recently proposed node numbering schemes enable the capturing of the XML document structure using traditional indices (such as B+-trees or R-trees). This paper proposes efficient structural join algorithms in the presence of tag indices. We first concentrate on using B+- trees and show how to expedite a structural join by avoiding collections of elements that do not participate in the join. We then introduce an enhancement (based on sibling pointers) that further improves performance. Such sibling pointers are easily implemented and dynamically maintainable. We also present a structural join algorithm that utilizes R-trees. An extensive experimental comparison shows that the B+-tree structural joins are more robust. Furthermore, they provide drastic improvement gains over the current state of the art. VLDB Tree Pattern Aggregation for Scalable XML Data Dissemination. Chee Yong Chan,Wenfei Fan,Pascal Felber,Minos N. Garofalakis,Rajeev Rastogi 2002 "With the rapid growth of XML-document traffic on the Internet, scalable content-based dissemination of XML documents to a large, dynamic group of consumers has become an important research challenge. To indicate the type of content that they are interested in, data consumers typically specify their subscriptions using some XML pattern specification language (e.g., XPath). Given the large volume of subscribers, system scalability and efficiency mandate the ability to aggregate the set of consumer subscriptions to a smaller set of content specifications, so as to both reduce their storage-space requirements as well as speed up the document-subscription matching process. In this paper, we provide the first systematic study of subscription aggregation where subscriptions are specified with tree patterns (an important subclass of XPath expressions). The main challenge is to aggregate an input set of tree patterns into a smaller set of generalized tree patterns such that: (1) a given space constraint on the total size of the subscriptions is met, and (2) the loss in precision (due to aggregation) during document filtering is minimized. We propose an efficient tree-pattern aggregation algorithm that makes effective use of document-distribution statistics in order to compute a precise set of aggregate tree patterns within the allotted space budget. As part of our solution, we also develop several novel algorithms for tree-pattern containment and minimization, as well as ""least-upper-bound"" computation for a set of tree patterns. These results are of interest in their own right, and can prove useful in other domains, such as XML query optimization. Extensive results from a prototype implementation validate our approach." VLDB RE-Tree: An Efficient Index Structure for Regular Expressions. Chee Yong Chan,Minos N. Garofalakis,Rajeev Rastogi 2002 "Abstract.Due to their expressive power, regular expressions (REs) are quickly becoming an integral part of language specifications for several important application scenarios. Many of these applications have to manage huge databases of RE specifications and need to provide an effective matching mechanism that, given an input string, quickly identifies the REs in the database that match it. In this paper, we propose the RE-tree, a novel index structure for large databases of RE specifications. Given an input query string, the RE-tree speeds up the retrieval of matching REs by focusing the search and comparing the input string with only a small fraction of REs in the database. Even though the RE-tree is similar in spirit to other tree-based structures that have been proposed for indexing multidimensional data, RE indexing is significantly more challenging since REs typically represent infinite sets of strings with no well-defined notion of spatial locality. To address these new challenges, our RE-tree index structure relies on novel measures for comparing the relative sizes of infinite regular languages. We also propose innovative solutions for the various RE-tree operations including the effective splitting of RE-tree nodes and computing a ""tight"" bounding RE for a collection of REs. Finally, we demonstrate how sampling-based approximation algorithms can be used to significantly speed up the performance of RE-tree operations. Preliminary experimental results with moderately large synthetic data sets indicate that the RE-tree is effective in pruning the search space and easily outperforms naive sequential search approaches." VLDB Optimizing the Secure Evaluation of Twig Queries. SungRan Cho,Sihem Amer-Yahia,Laks V. S. Lakshmanan,Divesh Srivastava 2002 "The rapid emergence of XML as a standard for data exchange over the Web has led to considerable interest in the problem of securing XML documents. In this context, query evaluation engines need to ensure that user queries only use and return XML data the user is allowed to access. These added access control checks can considerably increase query evaluation time. In this paper, we consider the problem of optimizing the secure evaluation of XML twig queries. We focus on the simple, but useful, multi-level access control model, where a security level can be either specified at an XML element, or inherited from its parent. For this model, secure query evaluation is possible by rewriting the query to use a recursive function that computes an element's security level. Based on security information in the DTD, we devise efficient algorithms that optimally determine when the recursive check can be eliminated, and when it can be simplified to just a local check on the element's attributes, without violating the access control policy. Finally, we experimentally evaluate the performance benefits of our techniques using a variety of XML data and queries." VLDB Streaming Queries over Streaming Data. Sirish Chandrasekaran,Michael J. Franklin 2002 Recent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real-time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also require the processing of data that arrives prior to query submission or during a period of disconnection. For such applications, we have developed PSoup, a system that combines the processing of ad-hoc and continuous queries by treating data and queries symmetrically, allowing new queries to be applied to old data and new data to be applied to old queries. PSoup also supports intermittent connectivity by separating the computation of query results from the delivery of those results. PSoup builds on adaptive query processing techniques developed in the Telegraph project at UC Berkeley. In this paper, we describe PSoup and present experiments that demonstrate the effectiveness of our approach. VLDB Effective Change Detection Using Sampling. Junghoo Cho,Alexandros Ntoulas 2002 For a large-scale data-intensive environment, such as the World-Wide Web or data warehousing, we often make local copies of remote data sources. Due to limited network and computational resources, however, it is often difficult to monitor the sources constantly to check for changes and to download changed data items to the copies. In this scenario, our goal is to detect as many changes as we can using the fixed download resources that we have. In this paper we propose three sampling-based download policies that can identify more changed data items effectively. In our sampling-based approach, we first sample a small number of data items from each data source and download more data items from the sources with more changed samples. We analyze the effectiveness of the sampling-based policies and compare our proposed policies to existing ones, including the state-of-the-art frequency-based policy in [8, 11]. Our experiments on synthetic and real-world data will show the relative merits of various policies and the great potential of our sampling-based policy. In certain cases, our sampling-based policy could download twice as many changed items as the best existing policy. VLDB A Bandwidth Model for Internet Search. Axel Uhl 2002 "In this paper a formal model for the domain of Internet search is presented that makes it possible to quantify the relations between important parameters of a distributed search architecture. Among these are physical network parameters, query frequency, required currency of search results, change rate of the data to be searched, logical network topology, and total bandwidth consumption for answering one query. The model is then used to compute many important relations between the various parameters. The results can be used to quantitatively assess, streamline, and optimize distributed Internet search architectures. The results back the general perception that a centralized approach to Internet-scale search will no longer be able to provide the desired coverage and currency, especially given that the Internet's content keeps growing much faster than the bandwidth available to index it. Using a hierarchical distribution approach and using change-based update notications instead of polling for changes allows to address sets of objects that are several orders of magnitude larger than what is possible with a centralized approach. Yet, using such an approach does not signicantly increase the total bandwidth required for a single query per object reached by the search." VLDB Profiling and Internet Connectivity in Automotive Environments. Mariano Cilia,P. Hasselmayer,Alejandro P. Buchmann 2002 This demo combines active DB technology in open, heterogeneous environments with the Web presence requirements of nomadic users. It illustrates these through profiling of users and Internet-enabled vehicles. A scenario is developed in which useful functionality is provided, such as instrument adjustments, maintenance and diagnostic information handling with the corresponding workflows, and convenience features, such as position-dependent language translation support and traffic information. The customization mechanism relies on an active functionality service. VLDB Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment. Jack G. Conrad,Xi S. Guo,Peter Jackson,Monem Meziou 2002 "The continued growth of very large data environments such as Westlaw, Dialog, and the World Wide Web, increases the importance of effective and efficient database selection and searching. Recent research has focused on autonomous and automatic collection selection, searching, and results merging in distributed environments. These studies often rely on TREC data and queries for experimentation. We have extended this work to West's on-line production environment where thousands of legal, financial and news databases are accessed by up to a quarter-million professional users each day. Using the WIN natural language search engine, a cousin to UMass's INQUERY, along with a collection retrieval inference network (CORI) to provide database scoring, we examine the effect that a set of optimized parameters has on database selection performance. We also compare current language modeling techniques to this approach. Traditionally, West's information has been structured over 15,000 online databases, representing roughly 6 terabytes of textual data. Given the expense of running global searches in this environment, it is usually not practical to perform full document retrieval over the entire collection. It is therefore necessary to create a new infrastructure to support automatic database selection in the service of broader searching. In this research, we represent our operational environment in two distinct ways. First, we characterize the underlying physical databases that serve as a foundation for the entire Westlaw search system. Second, we create a rearchitected set of logical document collections that corresponds to classes of high level organizational concepts such as jurisdiction, practice area, and document-type. Keeping the end-user in mind, we focus on performance issues relating to optimal database selection, where domain experts have provided complete pre-hoc relevance judgments for collections characterized under each of our physical and logical database models." VLDB Multi-Dimensional Regression Analysis of Time-Series Data Streams. Yixin Chen,Guozhu Dong,Jiawei Han,Benjamin W. Wah,Jianyong Wang 2002 Real-time production systems and other dynamic environments often generate tremendous (potentially infinite) amount of stream data; the volume of data is too huge to be stored on disks or scanned multiple times. Can we perform on-line, multi-dimensional analysis and data mining of such data to alert people about dramatic changes of situations and to initiate timely, high-quality responses? This is a challenging task. In this paper, we investigate methods for on-line, multi-dimensional regression analysis of time-series stream data, with the following contributions: (1) our analysis shows that only a small number of compressed regression measures instead of the complete stream of data need to be registered for multi-dimensional linear regression analysis, (2) to facilitate on-line stream data analysis, a partially materialized data cube model, with regression as measure, and a tilt time frame as its time dimension, is proposed to minimize the amount of data to be retained in memory or stored on disks, and (3) an exception-guided drilling approach is developed for on-line, multi-dimensional exception-based regression analysis. Based on this design, algorithms are proposed for efficient analysis of time-series data streams. Our performance study compares the proposed algorithms and identifies the most memory- and time- efficient one for multi-dimensional stream data analysis. VLDB Comparing Data Streams Using Hamming Norms (How to Zero In). Graham Cormode,Mayur Datar,Piotr Indyk,S. Muthukrishnan 2002 "Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases, and instead must be processed ""on the fly"" as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams, and hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalises ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the ""l0 sketch"" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points." VLDB REFEREE: An Open Framework for Practical Testing of Recommender Systems using ResearchIndex. Dan Cosley,Steve Lawrence,David M. Pennock 2002 Automated recommendation (e.g., personalized product recommendation on an ecommerce web site) is an increasingly valuable service associated with many databases--typically online retail catalogs and web logs. Currently, a major obstacle for evaluating recommendation algorithms is the lack of any standard, public, real-world testbed appropriate for the task. In an attempt to fill this gap, we have created REFEREE, a framework for building recommender systems using ResearchIndex--a huge online digital library of computer science research papers--so that anyone in the research community can develop, deploy, and evaluate recommender systems relatively easily and quickly. Research Index is in many ways ideal for evaluating recommender systems, especially so-called hybrid recommenders that combine information filtering and collaborative filtering techniques. The documents in the database are associated with a wealth of content information (author, title, abstract, full text) and collaborative information (user behaviors), as well as linkage information via the citation structure. Our framework supports more realistic evaluation metrics that assess user buy-in directly, rather than resorting to offline metrics like prediction accuracy that may have little to do with end user utility. The sheer scale of ResearchIndex (over 500,000 documents with thousands of user accesses per hour) will force algorithm designers to make real-world trade-offs that consider performance, not just accuracy. We present our own tradeoff decisions in building an example hybrid recommender called PD-Live. The algorithm uses content-based similarity information to select a set of documents from which to recommend, and collaborative information to rank the documents. PD-Live performs reasonably well compared to other recommenders in ResearchIndex. VLDB SQL Memory Management in Oracle9i. Benoît Dageville,Mohamed Zaït 2002 Complex database queries require the use of memory-intensive operators like sort and hash-join. Those operators need memory, also referred to as SQL memory, to process their input data. For example, a sort operator uses a work area to perform the in-memory sort of a set of rows. The amount of memory allocated by these operators greatly affects their performance. However, there is only a finite amount of memory available in the system, shared by all concurrent operators. The challenge for database systems is to design a fair and efficient strategy to manage this memory. Commercial database systems rely on database administrators (DBA) to supply an optimal setting for configuration parameters that are internally used to decide how much memory to allocate to a given database operator. However, database systems continue to be deployed in new areas, e.g, e-commerce, and the database applications are increasingly complex, e.g, to provide more functionality, and support more users. One important consequence is that the application workload is very hard, if not impossible, to predict. So, expecting a DBA to find an optimal value for memory configuration parameters is not realistic. The values can only be optimal for a limited period of time while the workload is within the assumed range. Ideally, the optimal value should adapt in response to variations in the application workload. Several research projects addressed this problem in the past, but very few commercial systems proposed a comprehensive solution to managing memory used by SQL operators in a database application with a variable workload. This paper presents a new model used in Oracle9i to manage memory for database operators. This approach is automatic, adaptive and robust. We will present the architecture of the memory manager, the internal algorithms, and a performance study showing its superiority. VLDB Progressive Merge Join: A Generic and Non-blocking Sort-based Join Algorithm. Jens-Peter Dittrich,Bernhard Seeger,David Scot Taylor,Peter Widmayer 2002 "Many state-of-the-art join-techniques require the input relations to be almost fully sorted before the actual join processing starts. Thus, these techniques start producing first results only after a considerable time period has passed. This blocking behaviour is a serious problem when consequent operators have to stop processing, in order to wait for first results of the join. Furthermore, this behaviour is not acceptable if the result of the join is visualized or/ and requires user interaction. These are typical scenarios for data mining applications. The, off-time' of existing techniques even increases with growing problem sizes. In this paper, we propose a generic technique called Progressive Merge Join (PMJ) that eliminates the blocking behaviour of sort-based join algorithms. The basic idea behind PMJ is to have the join produce results, as early as the external mergesort generates initial runs. Hence, it is possible for PMJ to return first results very early. This paper provides the basic algorithms and the generic framework of PMJ, as well as use-cases for different types of joins. Moreover, we provide a generic online selectivity estimator with probabilistic quality guarantees. For similarity joins in particular, first non-blocking join algorithms are derived from applying PMJ to the state-of-the-art techniques. We have implemented PMJ as part of an object-relational cursor algebra. A set of experiments shows that a substantial amount of results are produced, even before the input relationas would have been sorted. We observed only a moderate increase in the total runtime compared to the blocking counterparts." VLDB Foundation Matters. C. J. Date 2002 "This talk is meant as a wake-up call ... The foundation of the database field is, of course, the relational model. Sad to say, however, there are some in the database community--certainly in industry, and to some extent in academia also--who do not seem to be as familiar with that model as they ought to be; there are others who seem to think it is not very interesting or relevant to the day-today business of earning a living; and there are still others who seem to think all of the foundation-level problems have been solved. Indeed, there seems to be a widespread feeling that ""the world has moved on,"" so to speak, and the relational model as such is somehow passé. In my opinion, nothing could be further from the truth! In this talk, I want to sketch the results of some of my own investigations into database foundations over the past twenty years or so; my aim is to convey some of the excitement and abiding interest that is still to be found in those investigations, with a view--I hope--to inspiring others in the field to become involved in such activities. First of all, almost all of the ideas I will be covering either are part of, or else build on top of, The Third Manifesto [1]. The Third Manifesto is a detailed proposal for the future direction of data and DBMSs. Like Codd's original papers on the relational model, it can be seen as an abstract blueprint for the design of a DBMS and the language interface to such a DBMS. Among many other things: • It shows that the relational model--and I do mean the relational model, not SQL--is a necessary and sufficient foundation on which to build ""object/relational"" DBMSs (sometimes called universal servers). • It also points out certain blunders that can unfortunately be observed in some of today's products (not to mention the SQL:1999 standard). • And it explores in depth the idea that a relational database, along with the relational operators, is really a logical system and shows how that idea leads to a solution to the view updating problem, among other things." VLDB Lightweight Flexible Isolation for Language-based Extensible Systems. Laurent Daynès,Grzegorz Czajkowski 2002 Safe programming languages encourage the development of dynamically extensible systems, such as extensible Web servers and mobile agent platforms. Although protection is of utmost importance in these settings, current solutions do not adequately address fault containment. This paper advocates an approach to protection where transactions act as protection domains. This enables direct sharing of objects while protecting against unauthorized accesses and failures of authorized components. The main questions about this approach are what transaction models translate best into protection mechanisms suited for extensible language-based systems and what is the impact of transaction-based protection on performance. A programmable isolation engine has been integrated with the runtime of a safe programming language in order to allow quick experimentation with a variety of isolation models and to answer both questions. This paper reports on the techniques for flexible fine-grained locking and undo devised to meet the functional and performance requirements of transaction-based protection. Performance analysis of a prototype implementation shows that (i) sophisticated concurrency controls do not translate into higher overheads, and (ii) the ability to memoize locking operations is crucial to performance. VLDB Sensor Data Mining: Similarity Search and Pattern Analysis. Christos Faloutsos 2002 Sensor Data Mining: Similarity Search and Pattern Analysis. VLDB Plan Selection Based on Query Clustering. Antara Ghosh,Jignashu Parikh,Vibhuti S. Sengar,Jayant R. Haritsa 2002 Query optimization is a computationally intensive process, especially for complex queries. We present here a tool, called PLASTIC, that can be used by query optimizers to amortize the optimization cost. Our scheme groups similar queries into clusters and uses the optimizer-generated plan for the cluster representative to execute all future queries assigned to the cluster. Query similarity is evaluated based on a comparison of query structures and the associated table schemas and statistics, and a classifier is employed for efficient cluster assignments. Experiments with a variety of queries on a commercial optimizer show that PLASTIC predicts the correct plan choice in most cases, thereby providing significantly improved query optimization times. Further, when errors are made, the additional execution cost incurred due to the sub-optimal plan choices is marginal. VLDB How to Summarize the Universe: Dynamic Maintenance of Quantiles. Anna C. Gilbert,Yannis Kotidis,S. Muthukrishnan,Martin Strauss 2002 Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data mining. We present a new algorithm for dynamically computing quantiles of a relation subject to insert as well as delete operations. The algorithm monitors the operations and maintains a simple, small-space representation (based on random subset sums or RSSs) of the underlying data distribution. Using these RSSs, we can quickly estimate, without having to access the data, all the quantiles, each guaranteed to be accurate to within user-specified precision. Previously-known one-pass quantile estimation algorithms that provide similar quality and performance guarantees can not handle deletions. Other algorithms that can handle delete operations cannot guarantee performance without rescanning the entire database. We present the algorithm, its theoretical performance analysis and extensive experimental results with synthetic and real datasets. Independent of the rates of insertions and deletions, our algorithm is remarkably precise at estimating quantiles in small space, as our experiments demonstrate. VLDB Querying and Mining Data Streams: You Only Get One Look. Minos N. Garofalakis,Johannes Gehrke 2002 Querying and Mining Data Streams: You Only Get One Look. VLDB A New Passenger Support System for Public Transport using Mobile Database Access. Koichi Goto,Yahiko Kambayashi 2002 "We have been developing a mobile passenger support system for public transport. Passengers can make their travel plans and purchase necessary tickets by accessing databases via the system. After starting the travel, a mobile terminal checks the travel schedule of its user by accessing several databases and gathering various kinds of information. In this application field, many kinds of data must be handled. Examples of such data are route information, fare information, area map, station map, planned operation schedule, real-time operation schedule, vehicle facilities and so on. Depending on the user's situation, different information should be supplied and personalized. In this paper, we propose a new mechanism to support passengers using the multi-channel data communication environments. On the other hand, transport systerns can gather information about situations and demands of users and modify their services offered for the users. We also describe a prototype system developed for visually handicapped passengers and the results of tests in an actual railway station." VLDB Efficient Algorithms for Processing XPath Queries. Georg Gottlob,Christoph Koch,Reinhard Pichler 2002 Our experimental analysis of several popular XPath processors reveals a striking fact: Query evaluation in each of the systems requires time exponential in the size of queries in the worst case. We show that XPath can be processed much more efficiently, and propose main-memory algorithms for this problem with polynomial-time combined query evaluation complexity. Moreover, we show how the main ideas of our algorithm can be profitably integrated into existing XPath processors. Finally, we present two fragments of XPath for which linear-time query processing algorithms exist and another fragment with linear-space/quadratic-time query processing. VLDB Searching on the Secondary Structure of Protein Sequences. Laurie Hammel,Jignesh M. Patel 2002 In spite of the many decades of progress in database research, surprisingly scientists in the life sciences community still struggle with inefficient and awkward tools for querying biological data sets. This work highlights a specific problem involving searching large volumes of protein data sets based on their secondary structure. In this paper we define an intuitive query language that can be used to express queries on secondary structure and develop several algorithms for evaluating these queries. We implement these algorithms both in Periscope, a native system that we have built, and in a commercial ORDBMS. We show that the choice of algorithms can have a significant impact on query performance. As part of the Periscope implementation we have also developed a framework for optimizing these queries and for accurately estimating the costs of the various query evaluation plans. Our performance studies show that the proposed techniques are very efficient in the Periscope system and can provide scientists with interactive secondary structure querying options even on large protein data sets. VLDB enTrans: A System for Flexible Consistency Maintenance in Directory Applications. Anandi Herlekar,Atul Deopujari,Krithi Ramamritham,Shaymsunder Gopale,Shridhar Shukla 2002 enTrans: A System for Flexible Consistency Maintenance in Directory Applications. VLDB Viator - A Tool Family for Graphical Networking and Data View Creation. Stephan Heymann,Katja Tham,Axel Kilian,Gunnar Wegner,Peter Rieger,Dieter Merkel,Johann Christoph Freytag 2002 "Web-based data sources, particularly in Life Sciences, grow in diversity and volume. Most of the data collections are equipped with common document search, hyperlink and retrieval utilities. However, users' wishes often exceed simple document-oriented inquiries. With respect to complex scientific issues it becomes imperative to aid knowledge gain from huge interdependent and thus hard to comprehend data collections more efficiently. Especially data categories that constitute relationships between two each or more items require potent set-oriented content management, visualization and navigation utilities. Moreover, strategies are needed to discover correlations within and between data sets of independent origin. Wherever data sets possess intrinsic graph structure (e.g. of tree, forest or network type) or can be transposed into such, graphical support is considered indispensable. The Viator tool family presented during this demo depicts large graphs on the whole in a hyperbolic geometry and provides means for set-oriented context mining as well as for correlation discovery across distinct data sets at once. Its utility is proven for but not restricted to data from functional genome, transcriptome and proteome research. Viator versions are being operated either as user-end database applications or as template-fed stand-alone solutions for graphical networking." VLDB DISCOVER: Keyword Search in Relational Databases. Vagelis Hristidis,Yannis Papakonstantinou 2002 DISCOVER operates on relational databases and facilitates information discovery on them by allowing its user to issue keyword queries without any knowledge of the database schema or of SQL. DISCOVER returns qualified joining networks of tuples, that is, sets of tuples that are associated because they join on their primary and foreign keys and collectively contain all the keywords of the query. DISCOVER proceeds in two steps. First the Candidate Network Generator generates all candidate networks of relations, that is, join expressions that generate the joining networks of tuples. Then the Plan Generator builds plans for the efficient evaluation of the set of candidate networks, exploiting the opportunities to reuse common subexpressions of the candidate networks. We prove that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema. We prove that the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete. We provide a greedy algorithm and we show that it provides near-optimal plan execution time cost. Our experimentation also provides hints on tuning the greedy algorithm. VLDB Advanced Database Technologies in a Diabetic Healthcare System. Wynne Hsu,Mong-Li Lee,Beng Chin Ooi,Pranab Kumar Mohanty,Keng Lik Teo,Chenyi Xia 2002 With the increased emphasis on healthcare worldwide, the issue of being able to efficiently and effectively manage large amount of patient information in diverse medium becomes critical. In this work, we will demonstrate how advanced database technologies are used in RETINA, an integrated system for the screening and management of diabetic patients. RETINA captures the profile and retinal images of diabetic patients and automatically processes the retina fundus images to extract interesting features. Given the wealth of information acquired, we employ novel techniques to determine the risk profile of patients for better patient care management and to target significant subpopulations for more detailed studies. The results of such studies can be used to introduce effective preventive measures for the targeted sub-populations. VLDB Parametric Query Optimization for Linear and Piecewise Linear Cost Functions. Arvind Hulgeri,S. Sudarshan 2002 The cost of a query plan depends on many parameters, such as predicate selectivities and available memory, whose values may not be known at optimization time. Parametric query optimization (PQO) optimizes a query into a number of candidate plans, each optimal for some region of the parameter space. We first propose a solution for the PQO problem for the case when the cost functions are linear in the given parameters. This solution is minimally intrusive in the sense that an existing query optimizer can be used with minor modifications: the solution invokes the conventional query optimizer multiple times, with different parameter values. We then propose a solution for the PQO problem for the case when the cost functions are piecewise-linear in the given parameters. The solution is based on modification of an existing query optimizer. This solution is quite general, since arbitrary cost functions can be approximated to piecewise linear form. Both the solutions work for an arbitrary number of parameters. VLDB Joining Ranked Inputs in Practice. Ihab F. Ilyas,Walid G. Aref,Ahmed K. Elmagarmid 2002 "Joining ranked inputs is an essential requirement for many database applications, such as ranking search results from multiple search engines and answering multi-feature queries for multimedia retrieval systems. We introduce a new practical pipelined query operator, termed NRA-RJ, that produces a global rank from input ranked streams based on a score function. The output of NRA-RJ can serve as a valid input to other NRA-RJ operators in the query pipeline. Hence, the NRA-RJ operator can support a hierarchy of join operations and can be easily integrated in query processing engines of commercial database systems. The NRA-RJ operator bridges Fagin's optimal aggregation algorithm into a practical implementation and contains several optimizations that address performance issues. We compare the performance of NRA-RJ against recent rank join algorithms. Experimental results demonstrate the performance trade-offs among these algorithms. The experimental results are based on an empirical study applied to a medical video application on top of a prototype database system. The study reveals important design options and shows that the NRA-RJ operator outperforms other pipelined rank join operators when the join condition is an equi-join on key attributes." VLDB Wireless Graffiti - Data, Data Everywhere Matters. Tomasz Imielinski,B. R. Badrinath 2002 Wireless Graffiti - Data, Data Everywhere Matters. VLDB Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. Panagiotis G. Ipeirotis,Luis Gravano 2002 "Many valuable text databases on the web have non-crawlable contents that are ""hidden"" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from ""uncooperative"" databases by using ""focused query probes,"" which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts." VLDB Exploiting Versions for On-line Data Warehouse Maintenance in MOLAP Servers. Heum-Geun Kang,Chin-Wan Chung 2002 A data warehouse is an integrated database whose data is collected from several data sources, and supports on-line analytical processing (OLAP). Typically, a query to the data warehouse tends to be complex and involves a large volume of data. To keep the data at the warehouse consistent with the source data, changes to the data sources should be propagated to the data warehouse periodically. Because the propagation of the changes (maintenance) is batch processing, it takes long time. Since both query transactions and maintenance transactions are long and involve large volumes of data, traditional concurrency control mechanisms such as two-phase locking are not adequate for a data warehouse environment. We propose a multi-version concurrency control mechanism suited for data warehouses which use multi-dimensional OLAP (MOLAP) servers. We call the mechanism multiversion concurrency control for data warehouses (MVCCDW). To our knowledge, our work is the first attempt to exploit versions for online data warehouse maintenance in a MOLAP environment. MVCC-DW guarantees the serializability of concurrent transactions. Transactions running under the mechanism do not block each other and do not need to place locks. VLDB Processing Star Queries on Hierarchically-Clustered Fact Tables. Nikos Karayannidis,Aris Tsois,Timos K. Sellis,Roland Pieringer,Volker Markl,Frank Ramsak,Robert Fenk,Klaus Elhardt,Rudolf Bayer 2002 Star queries are the most prevalent kind of queries in data warehousing, OLAP and business intelligence applications. Thus, there is an imperative need for efficiently processing star queries. To this end, a new class of fact table organizations has emerged that exploits path-based surrogate keys in order to hierarchically cluster the fact table data of a star schema [DRSN98, MRB99, KS01]. In the context of these new organizations, star query processing changes radically. In this paper, we present a complete abstract processing plan that captures all the necessary steps in evaluating such queries over hierarchically clustered fact tables. Furthermore, we present optimizations for surrogate key processing and a novel early grouping transformation for grouping on the dimension hierarchies. Our algorithms have been already implemented in a commercial relational database management system (RDBMS) and the experimental evaluation, as well as customer feedback, indicates speedups of orders of magnitude for typical star queries in real world applications. VLDB Updates for Structure Indexes. Raghav Kaushik,Philip Bohannon,Jeffrey F. Naughton,Pradeep Shenoy 2002 The problem of indexing path queries in semistructured/XML databases has received considerable attention recently, and several proposals have advocated the use of structure indexes as supporting data structures for this problem. In this paper, we investigate efficient update algorithms for structure indexes. We study two kinds of updates -- the addition of a subgraph, intended to represent the addition of a new file to the database, and the addition of an edge, to represent a small incremental change. We focus on three instances of structure indexes that are based on the notion of graph bisimilarity. We propose algorithms to update the bisimulation partition for both kinds of updates and show how they extend to these indexes. Our experiments on two real world data sets show that our update algorithms are an order of magnitude faster than dropping and rebuilding the index. To the best of our knowledge, no previous work has addressed updates for structure indexes based on graph bisimilarity. VLDB ServiceGlobe: Distributing E-Services Across the Internet. Markus Keidl,Stefan Seltzsam,Konrad Stocker,Alfons Kemper 2002 ServiceGlobe: Distributing E-Services Across the Internet. VLDB Exact Indexing of Dynamic Time Warping. Eamonn J. Keogh 2002 The problem of indexing time series has attracted much research interest in the database community. Most algorithms used to index time series utilize the Euclidean distance or some variation thereof. However is has been forcefully shown that the Euclidean distance is a very brittle distance measure. Dynamic Time Warping (DTW) is a much more robust distance measure for time series, allowing similar shapes to match even if they are out of phase in the time axis. Because of this flexibility, DTW is widely used in science, medicine, industry and finance. Unfortunately however, DTW does not obey the triangular inequality, and thus has resisted attempts at exact indexing. Instead, many researchers have introduced approximate indexing techniques, or abandoned the idea of indexing and concentrated on speeding up sequential search. In this work we introduce a novel technique for the exact indexing of DTW. We prove that our method guarantees no false dismissals and we demonstrate its vast superiority over all competing approaches in the largest and most comprehensive set of time series indexing experiments ever undertaken. VLDB Foundations of Preferences in Database Systems. Werner Kießling 2002 "Personalization of e-services poses new challenges to database technology, demanding a powerful and flexible modeling technique for complex preferences. Preference queries have to be answered cooperatively by treating preferences as soft constraints, attempting a best possible match-making. We propose a strict partial order semantics for preferences, which closely matches people's intuition. A variety of natural and of sophisticated preferences are covered by this model. We show how to inductively construct complex preferences by means of various preference constructors. This model is the key to a new discipline called preference engineering and to a preference algebra. Given the Best-Matches-Only (BMO) query model we investigate how complex preference queries can be decomposed into simpler ones, preparing the ground for divide & conquer algorithms. Standard SQL and XPATH can be extended seamlessly by such preferences (presented in detail in the companion paper [15]). We believe that this model is appropriate to extend database technology towards effective support of personalization." VLDB Preference SQL - Design, Implementation, Experiences. Werner Kießling,Gerhard Köstler 2002 Current search engines can hardly cope adequately with fuzzy predicates defined by complex preferences. The biggest problem of search engines implemented with standard SQL is that SQL does not directly understand the notion of preferences. Preference SQL extends SQL by a preference model based on strict partial orders (presented in more detail in the companion paper [Kie02]), where preference queries behave like soft selection constraints. Several built-in base preference types and the powerful Pareto operator, combined with the adherence to declarative SQL programming style, guarantees great programming productivity. The Preference SQL optimizer does an efficient re-writing into standard SQL, including a high-level implementation of the skyline perator for Pareto-optimal sets. This pre-processor approach enables a seamless application integration, making Preference SQL available on all major SQL platforms. Several commercial B2C portals are powered by Preference SQL. Its benefits comprise cooperative query answering and smart customer advice, leading to higher e-customer satisfaction and shorter development times of personalized search engines. We report practical experiences ranging from m-commerce and comparison shopping to a large-scale performance test for a job portal. VLDB The Rubicon of Smart Data. Roger King 2002 The Rubicon of Smart Data. VLDB Reverse Nearest Neighbor Aggregates Over Data Streams. Flip Korn,S. Muthukrishnan,Divesh Srivastava 2002 "Reverse Nearest Neighbor (RNN) queries have been studied for finite, stored data sets and are of interest for decision support. However, in many applications such as fixed wireless telephony access and sensor-based highway traffic monitoring, the data arrives in a stream and cannot be stored. Exploratory analysis on this data stream can be formalized naturally using the notion of RNN aggregates (RNNAs), which involve the computation of some aggregate (such as C0UNT or MAX DISTANCE) over the set of reverse nearest neighbor ""clients"" associated with each ""server"". In this paper, we introduce and investigate the problem of computing three types of RNNA queries over data streams of ""client"" locations: (i) Max-RNNA: given K servers, return the maximum RNNA over all clients to their closest servers; (ii) List-RNNA: given K servers, return a list of RNNAs over all clients to each of the K servers; and (iii) Opt-RNNA: find a subset of at most K servers for which their RNNAs are below a given threshold. While exact computation of these queries is not possible in the data stream model, we present efficient algorithms to approximately answer these RNNA queries over data streams with error guarantees. We provide analytical proofs of constant factor approximations for many RNNA queries, and complement our analyses with experimental evidence of the accuracy of our techniques." VLDB Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. Donald Kossmann,Frank Ramsak,Steffen Rost 2002 Skyline queries ask for a set of interesting points from a potentially large set of data points. If we are traveling, for instance, a restaurant might be interesting if there is no other restaurant which is nearer, cheaper, and has better food. Skyline queries retrieve all such interesting restaurants so that the user can choose the most promising one. In this paper, we present a new online algorithm that computes the Skyline. Unlike most existing algorithms that compute the Skyline in a batch, this algorithm returns the first results immediately, produces more and more results continuously, and allows the user to give preferences during the running time of the algorithm so that the user can control what kind of results are produced next (e.g., rather cheap or rather near restaurants). VLDB Improving Data Access of J2EE Applications by Exploiting Asynchronous Messaging and Caching Services. Samuel Kounev,Alejandro P. Buchmann 2002 The J2EE platform provides a variety of options for making business data persistent using DBMS technology. However, the integration with existing backend database systems has proven to be of crucial importance for the scalability and performance of J2EE applications, because modern e-business systems are extremely data-intensive. As a result, the data access layer, and the link between the application server and the database server in particular, are very susceptible to turning into a system bottleneck. In this paper we use the ECperf benchmark as an example of a realistic application in order to illustrate the problems mentioned above and discuss how they could be approached and eliminated. In particular, we show how asynchronous, message-based processing could be exploited to reduce the load on the DBMS and improve system performance, scalability and reliability. Furthermore, we discuss the major issues related to the correct use of entity beans (the components provided by J2EE for modelling persistent data) and present a number of methods to optimize their performance utilizing caching mechanisms. We have evaluated the proposed techniques through measurements and have documented the performance gains that they provide. VLDB The Generalized MDL Approach for Summarization. Laks V. S. Lakshmanan,Raymond T. Ng,Christine Xing Wang,Xiaodong Zhou,Theodore Johnson 2002 "There are many applications in OLAP and data analysis where we identify regions of interest. For example, in OLAP, an analysis query involving aggregate sales performance of various products in different locations and seasons could help identify interesting cells, such as cells of a data cube having an aggregate sales higher than a threshold. While a normal answer to such a quiry merely returns all interesting cells, it may be far more informative to the user if the system return summaries or descriptions of regions formed from the identified cells. The minimum Description Length (MDL) principle is a well-known strategy for finding such region descriptions. In this paper, we propose a generalization of the MDL principle, called GMDL, and show that GMDL leads to fewer regions than MDL, and hence more concise ""answers"" returned to the user. The key idea is that a region may contain ""don't care"" cells (up to a global maximum), if these ""don't care"" cells help to form bigger summary regions, leading to a more concise overall summary. We study the problem of generating minimal region descriptions under the GMDL principle for two different scenarios. In the first, all dimensions of the data space are spatial. In the second scenario, all dimentions are categorial and organized in hierarchies. We propose region finding algorithms for both scenarios and evaluate their run time and compression performance using detailed experimentation. Our results show the effectiveness of the GMDL principle and the proposed algorithms." VLDB Quotient Cube: How to Summarize the Semantics of a Data Cube. Laks V. S. Lakshmanan,Jian Pei,Jiawei Han 2002 "Partitioning a data cube into sets of cells with ""similar behavior"" often better exposes the semantics in the cube. E.g., if we find that average boots sales in the West 10th store of Walmart was the same for winter as for the whole year, it signifies something interesting about the trend of boots sales in that location in that year. In this paper, we are interested in finding succinct summaries of the data cube, exploiting regularities present in the cube, with a clear basis. We would like the summary: (i) to be as concise as possible, (ii) to itself form a lattice preserving the rollup/drilldown semantics of the cube, and (iii) to allow the original cube to be fully recovered. We illustrate the utility of solving this problem and discuss the inherent challenges. We develop techniques for partitioning cube cells for obtaining succinct summaries, and introduce the quotient cube. We give efficient algorithms for computing it from a base table. For monotone aggregate functions (e.g., COUNT, MIN, MAX, SUM on non-negative measures, etc.), our solution is optimal (i.e., quotient cube of the least size). For nonmonotone functions (e.g., AVG), we obtain a locally optimal solution. We experimentally demonstrate the efficacy of our ideas and techniques and the scalability of our algorithms." VLDB RTMonitor: Real-Time Data Monitoring Using Mobile Agent Technologies. Kam-yiu Lam,Alan Kwan,Krithi Ramamritham 2002 RTMonitor is a real-time data management system for traffic navigation applications. In our system, mobile vehicles initiate time-constrained navigation requests and RTMonitor calculates and communicates the best paths for the clients based on the road network and real-time traffic data. The correctness of the suggested routes highly depends on how well the system can maintain temporal consistency of the traffic data. To minimize the overheads of maintaining the real-time data, RTMonitor adopts a cooperative and distributed approach using mobile agents which can greatly reduce the amount of communications and improves the scalability of the system. To minimize the space and message overheads, we have designed a two-level traffic graph scheme to organize the real-time traffic data to support navigation requests. In the framework, the agents use an Adaptive PUSH OR PULL (APoP) scheme to maintain the temporal consistency of the traffic data. Our experiments using synthetic traffic data show that RTMonitor can provide efficient support to serve navigation requests in a timely fashion. Although several agents may be needed to serve a request, the size of each agent is very small (only a few kilobytes) and the resulting communication and processing overheads for data monitoring can be maintained within a reasonable level. VLDB The gRNA: A Highly Programmable Infrastructure for Prototyping, Developing and Deploying Genomics-Centric Applications. Amey V. Laud,Sourav S. Bhowmick,Pedro Cruz,Dadabhai T. Singh,George Rajesh 2002 The evolving challenges in lifesciences research cannot be all addressed by off-the-shelf bioinformatics applications. Life scientists need to analyze their data using novel or context-sensitive approaches that might be published in recent journals and publications, or based on their own hypotheses and assumptions. The genomics Research Network Architecture (gRNA) is a highly programmable, modular environment specially designed to invigorate the development of genomics-centric tools for life sciences-research. The gRNA provides the development environment in which new applications can be quickly written, and the deployment environment in which they can systematically avail of computing resources and integrate information from distributed biological data sources. VLDB A One-Pass Aggregation Algorithm with the Optimal Buffer Size in Multidimensional OLAP. Young-Koo Lee,Kyu-Young Whang,Yang-Sae Moon,Il-Yeol Song 2002 "Aggregation is an operation that plays a key role in multidimensional OLAP (MOLAP). Existing aggregation methods in MOLAP have been proposed for file structures such as multidimensional arrays. These file structures are suitable for data with uniform distributions, but do not work well with skewed distributions. In this paper, we consider an aggregation method that uses dynamic multidimensional files adapting to skewed distributions. In these multidimensional files, the sizes of page regions vary according to the data density in these regions, and the pages that belong to a larger region are accessed multiple times while computing aggregations. To solve this problem, we first present an aggregation computation model, called the Disjoint-Inclusive Partition (DIP) computation model, that is the formal basis of our approach. Based on this model, we then present the one-pass aggregation algorithm. This algorithm computes aggregations using the one-pass buffer size, which is the minimum buffer size required for guaranteeing one disk access per page. We prove that our aggregation algorithm is optimal with respect to the one-pass buffer size under our aggregation computation model. Using the DIP computation model allows us to correctly predict the order of accessing data pages in advance. Thus, our algorithm achieves the optimal one-pass buffer size by using a buffer replacement policy, such as Belady's B0 or Toss-Immediate policies, that exploits the page access order computed in advance. Since the page access order is not known a priori in general, these policies have been known to lack practicality despite its theoretic significance. Nevertheless, in this paper, we show that these policies can be effectively used for aggregation computation. We have conducted extensive experiments. We first demonstrate that the one-pass buffer size theoretically derived is indeed correct in real environments. We then compare the performance of the one-pass algorithm with those of other ones. Experimental results for a real data set show that the one-pass algorithm reduces the number of disk accesses by up to 7.31 times compared with a naive algorithm. We also show that the memory requirement of our algorithm for processing the aggregation in one-pass is very small being 0.05%|0.6% of the size of the database. These results indicate that our algorithm is practically usable even for a fairly large database. We believe our work provides an excellent formal basis for investigating further issues in computing aggregations in MOLAP." VLDB Optimizing Result Prefetching in Web Search Engines with Segmented Indices. Ronny Lempel,Shlomo Moran 2002 We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages that search engines should prepare during the query processing phase.Search engine users have been observed to browse through very few pages of results for queries that they submit. This behavior of users suggests that prefetching many results upon processing an initial query is not efficient, since most of the prefetched results will not be requested by the user who initiated the search. However, a policy that abandons result prefetching in favor of retrieving just the first page of search results might not make optimal use of system resources either.We argue that for a certain behavior of users, engines should prefetch a constant number of result pages per query. We define a concrete query processing model for search engines with segmented indices, and analyze the cost of such prefetching policies. Based on these costs, we show how to determine the constant that optimizes the prefetching policy. Our results are mostly applicable to local index partitions of the inverted files, but are also applicable to processing short queries in global index architectures. VLDB Issues and Evaluations of Caching Solutions for Web Application Acceleration. Wen-Syan Li,Wang-Pin Hsiung,Dmitri V. Kalashnikov,Radu Sion,Oliver Po,Divyakant Agrawal,K. Selçuk Candan 2002 Response time is a key differentiation among electronic commerce (e-commerce) applications. For many e-commerce applications, Web pages are created dynamically based on the current state of a business stored in database systems. Recently, the topic of Web acceleration for database-driven Web applications has drawn a lot of attention in both the research community and commercial arena. In this paper, we analyze the factors that have impacts on the performance and scalability of Web applications. We discuss system architecture issues and describe approaches to deploying caching solutions for accelerating Web applications. We give the performance matrix measurement for network latency and various system architectures. The paper is summarized with a road map for creating high performance Web applications. VLDB I/O-Conscious Data Preparation for Large-Scale Web Search Engines. Maxim Lifantsev,Tzi-cker Chiueh 2002 Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory. This technique is based on our experience from building a functionally complete and fully operational web search engine called Yuntis. As such, it is in particular well suited for most data manipulation tasks in a modern web search engine and is employed throughout Yuntis. The key idea of this technique is co-ordinated partitioning of related data tables and corresponding partitioning and delayed batched execution of the transformation and querying operations that work with the data. This data and processing partitioning is naturally compatible with distributed data storage and parallel execution on a cluster of workstations. Empirical measurements on the Yuntis prototype demonstrate that our technique can improve the performance of external-memory data preparation runs by a factor of 100 versus a straightforward implementation. VLDB XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation. Lipyeow Lim,Min Wang,Sriram Padmanabhan,Jeffrey Scott Vitter,Ronald Parr 2002 The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics. In this paper, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repository is either inaccessible or too large to be scanned in its entirety. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload-aware in collecting the statistics and thus can be more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using several real data sets. VLDB A-TOPSS - A Publish/Subscribe System Supporting Approximate Matching. Haifeng Liu,Hans-Arno Jacobsen 2002 A-TOPSS - A Publish/Subscribe System Supporting Approximate Matching. VLDB SMART: Making DB2 (More) Autonomic. Guy M. Lohman,Sam Lightstone 2002 "IBM's SMART (Self-Managing And Resource Tuning) project aims to make DB2 self-managing, i.e. autonomic, to decrease the total cost of ownership and penetrate new markets. Over several releases, increasingly sophisticated SMART features will ease administrative tasks such as initial deployment, database design, system maintenance, problem determination, and ensuring system availability and recovery." VLDB A Transducer-Based XML Query Processor. Bertram Ludäscher,Pratik Mukhopadhyay,Yannis Papakonstantinou 2002 The XML Stream Machine (XSM) system is a novel XQuery processing paradigm that is tuned to the efficient processing of sequentially accessed XML data (streams). The system compiles a given XQuery into an XSM, which is an XML stream transducer, i.e., an abstract device that takes as input one or more XML data streams and produces one or more output streams, potentially using internal buffers. We present a systematic way to translate XQueries into efficient XSMs: First the XQuery is translated into a network of XSMs that correspond to the basic operators of the XQuery language and exchange streams. The network is reduced to a single XSM by repeated application of an XSM composition operation that is optimized to reduce the number of tests and actions that the XSM performs as well as the number of intermediate buffers that it uses. Finally, the optimized XSM is compiled into a C program. First empirical results illustrate the performance benefits of the XSM-based processor. VLDB Extending an ORDBMS: The StateMachine Module. Wolfgang Mahnke,Christian Mathis,Hans-Peter Steiert 2002 Extensibility is one of the mayor benefits of object-relational database management systems. We have used this system property to implement a StateMachine Module inside an object-relational database management system. The module allows the checking of dynamic integrity constraints as well as the execution of active behavior specified with the UML. Our approach demonstrates that extensibility can effectively be applied to integrate such dynamic aspects specified with UML statecharts into an object-relational database management system. VLDB Generic Database Cost Models for Hierarchical Memory Systems. Stefan Manegold,Peter A. Boncz,Martin L. Kersten 2002 "Accurate prediction of operator execution time is a prerequisite for database query optimization. Although extensively studied for conventional disk-based DBMSs, cost modeling in main-memory DBMSs is still an open issue. Recent database research has demonstrated that memory access is more and more becoming a significant-- if not the major--cost component of database operations. If used properly, fast but small cache memories--usually organized in cascading hierarchy between CPU and main memory--can help to reduce memory access costs. However, they make the cost estimation problem more complex. In this article, we propose a generic technique to create accurate cost functions for database operations. We identify a few basic memory access patterns and provide cost functions that estimate their access costs for each level of the memory hierarchy. The cost functions are parameterized to accommodate various hardware characteristics appropriately. Combining the basic patterns, we can describe the memory access patterns of database operations. The cost functions of database operations can automatically be derived by combining the basic patterns' cost functions accordingly. To validate our approach, we performed experiments using our DBMS prototype Monet. The results presented here confirm the accuracy of our cost models for different operations. Aside from being useful for query optimization, our models provide insight to tune algorithms not only in a main-memory DBMS, but also in a disk-based DBMS with a large main-memory buffer cache." VLDB Approximate Frequency Counts over Data Streams. Gurmeet Singh Manku,Rajeev Motwani 2002 We present algorithms for computing frequency counts exceeding a user-specified threshold over data streams. Our algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a user-specified parameter. Our algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring. We can also handle streams of variable sized sets of items exemplified by a sequence of market basket transactions at a retail store. For such streams, we describe an optimized implementation to compute frequent itemsets in a single pass. VLDB Incorporating XSL Processing into Database Engines. Guido Moerkotte 2002 The two observations that 1) many XML documents are stored in a database or generated from data stored in a database and 2) processing these documents with XSL stylesheet processors is an important, often recurring task justify a closer look at the current situation. Typically, the XML document is retrieved or constructed from the database, exported, parsed, and then processed by a special XSL processor. This cumbersome process clearly sets the goal to incorporate XSL stylesheet processing into the database engine. We describe one way to reach this goal by translating XSL stylesheets into algebraic expressions. Further, we present algorithms to optimize the template rule selection process and the algebraic expression resulting from the translation. Along the way, we present several undecidability results hinting at the complexity of the problem on hand. VLDB Application Servers and Associated Technologies. C. Mohan 2002 Application Servers and Associated Technologies. VLDB An Efficient Method for Performing Record Deletions and Updates Using Index Scans. C. Mohan 2002 "We present a method for efficiently performing deletions and updates of records when the records to be deleted or updated are chosen by a range scan on an index. The traditional method involves numerous unnecessary lock calls and traversals of the index from root to leaves, especially when the qualifying records' keys span more than one leaf page of the index. Customers have suffered performance losses from these inefficiencies and have complained about them. Our goal was to minimize the number of interactions with the lock manager, and the number of page fixes, comparison operations and, possibly, I/Os. Some of our improvements come from increased synergy between the query planning and data manager components of a DBMS. Our patented method has been implemented in DB2 V7 to address specific customer requirements. It has also been done to improve performance on the TPC-H benchmark." VLDB ProTDB: Probabilistic Data in XML. Andrew Nierman,H. V. Jagadish 2002 Where as traditional databases manage only deterministic information, many applications that use databases involve uncertain data. This paper presents a Probabilistic Tree Data Base (ProTDB) to manage probabilistic data, represented in XML. Our approach differs from previous efforts to develop probabilistic relational systems in that we build a probabilistic XML database. This design is driven by application needs that involve data not readily amenable to a relational representation. XML data poses several modeling challenges: due to its structure, due to the possibility of uncertainty association at multiple granularities, and due to the possibility of missing and repeated sub-elements. We present a probabilistic XML model that addresses all of these challenges. We devise an implementation of XML query operations using our probability model, and demonstrate the efficiency of our implementation experimentally. We have used ProTDB to manage data from two application areas: protein chemistry data from the bioinformatics domain, and information extraction data obtained from the web using a natural language analysis system. We present a brief case study of the latter to demonstrate the value of probabilistic XML data management. VLDB eBusiness Standards and Architectures. Anil Nori 2002 eBusiness Standards and Architectures. VLDB Experiments on Query Expansion for Internet Yellow Page Services Using Web Log Mining. Yusuke Ohura,Katsumi Takahashi,Iko Pramudiono,Masaru Kitsuregawa 2002 "Tremendous amount of access log data is accumulated at many web sites. Several efforts to mine the data and apply the results to support end-users or to re-design the Web site's structure have been proposed. This paper describes our trial on access logs utilization from commercial yellow page service called ""iTOWNPAGE"". Our initial statistical analysis reveals that many users search various categories-even non-sibling ones in the provided hierarchy - together, or finish their search without any results that match their queries. To solve these problems, we first cluster user requests from the access logs using enhanced K-means clustering algorithm and then apply them for query expansion. Our method includes two-steps expansion that 1) recommends similar categories to the request, and 2) suggests related categories although they are nonsimilar in existing category hierarchy. We also report some evaluations that show the effectiveness of the prototype system." VLDB Sideway Value Algebra for Object-Relational Databases. Gultekin Özsoyoglu,Abdullah Al-Hamdani,Ismail Sengör Altingövde,Selma Ayse Özel,Özgür Ulusoy,Z. Meral Özsoyoglu 2002 Sideway Value Algebra for Object-Relational Databases. VLDB Incremental Maintenance for Non-Distributive Aggregate Functions. Themistoklis Palpanas,Richard Sidle,Roberta Cochrane,Hamid Pirahesh 2002 Incremental view maintenance is a well-known topic that has been addressed in the literature as well as implemented in database products. Yet, incremental refresh has been studied in depth only for a subset of the aggregate functions. In this paper we propose a general incremental maintenance mechanism that applies to all aggregate functions, including those that are not distributive over all operations. This class of functions is of great interest, and includes MIN/MAX, STDDEV, correlation, regression, XML constructor, and user defined functions. We optimize the maintenance of such views in two ways. First, by only recomputing the set of affected groups. Second, we extend the incremental infrastructure with work areas to support the maintenance of functions that are algebraic. We further optimize computation when multiple dissimilar aggregate functions are computed in the same view, and for special cases such as the maintenance of MIN/MAX, which are incrementally maintainable over insertions. We also address the important problem of incremental maintenance of views containing super-aggregates, including materialized OLAP cubes. We have implemented our algorithm on a prototype version of IBM DB2 UDB, and an experimental evaluation proves the validity of our approach. VLDB The Denodo Data Integration Platform. Alberto Pan,Juan Raposo,Manuel Álvarez,Paula Montoto,Vicente Orjales,Justo Hidalgo,Lucía Ardao,Anastasio Molano,Ángel Viña 2002 The world today is characterised by the proliferation of information sources available through media such as the WWW, databases, semi-structured files (e.g. XML documents), etc. Nevertheless, this information is usually scattered, heterogeneous and weakly structured, so it is difficult to process it automatically. DENODO Corporation has developed a mediator system for the construction of semi-structured and structured data integration applications. This system has already been used in the construction of several applications on the Internet and in corporate environments, which are currently deployed at several important Internet audience sites and large sized business corporations. In this extended abstract, we present an overview of the system and we put forward some conclusions arising from our experience in building real-world data integration applications, focusing in some challenges we believe require more attention from the research community. VLDB Structural Function Inlining Technique for Structurally Recursive XML Queries. Chang-Won Park,Jun-Ki Min,Chin-Wan Chung 2002 "Structurally recursive XML queries are an important query class that follows the structure of XML data. At present, it is difficult for XQuery to type and optimize structurally recursive queries because of polymorphic recursive functions involved in the queries. In this paper, we propose a new technique called structural function inlining which inlines recursive functions used in a query by making good use of available type information. Based on the technique, we develop a new approach to typing and optimizing structurally recursive queries. The new approach yields a more precise result type for a query. Furthermore, it produces an optimal algebraic expression for the query with respect to the type information. When a structurally recursive query is applied to non-recursive XML data, our approach translates the query into a finitely nested iterations. We conducted several experiments with commonly used real-life and synthetic datasets. The experimental results show that the number of node lookups by our approach is on the average 3.7 times and up to 279.8 times smaller than that by the XQuery core's current approach in evaluating structurally recursive queries." VLDB Structure and Value Synopses for XML Data Graphs. Neoklis Polyzotis,Minos N. Garofalakis 2002 All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows (1) path navigation and branching through the label structure of the XML data graph, and (2) predicates on the values of specific path/branch nodes, in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, we extent our earlier work on structural XSKETCH synopses and we propose an (augmented) XSKETCH synopsis model that exploits localized stability and value-distribution summaries (e.g., histograms) to accurately capture the complex correlation patterns that can exist between and across path structure and element values in the data graph. We develop a systematic XSKETCH estimation framework for complex path expressions with value predicates and we propose an efficient heuristic algorithm based on greedy forward selection for building an effective XSKETCH for a given amount of space (which is, in general, an NP-hard optimization problem). Implementation results with both synthetic and real-life data sets verify the effectiveness of our approach. VLDB Translating Web Data. Lucian Popa,Yannis Velegrakis,Renée J. Miller,Mauricio A. Hernández,Ronald Fagin 2002 Translating Web Data. VLDB A Case for Fractured Mirrors. Ravishankar Ramamurthy,David J. DeWitt,Qi Su 2002 Abstract.The decomposition storage model (DSM) vertically partitions all attributes of a table and has excellent I/O behavior when the number of attributes accessed by a query is small. It also has a better cache footprint than the standard storage model (NSM) used by most database systems. However, DSM incurs a high cost in reconstructing the original tuple from its partitions. We first revisit some of the performance problems associated with DSM and suggest a simple indexing strategy and compare different reconstruction algorithms. Then we propose a new mirroring scheme, termed fractured mirrors, using both NSM and DSM models. This scheme combines the best aspects of both models, along with the added benefit of mirroring to better serve an ad hoc query workload. A prototype system has been built using the Shore storage manager, and performance is evaluated using queries from the TPC-H workload. VLDB Champagne: Data Change Propagation for Heterogeneous Information Systems. Ralf Rantzau,Carmen Constantinescu,Uwe Heinkel,Holger Meinecke 2002 "Flexible methods supporting the data interchange between autonomous information systems are important for today's increasingly heterogeneous enterprise IT infrastructures. Updates, insertions, and deletions of data objects in autonomous information systems often have to trigger data changes in other autonomous systems, even if the distributed systems are not integrated into a global schema. We suggest a solution to this problem based on the propagation and transformation of data using several XML technologies. Our prototype manages dependencies between the schemas of distributed data sources and allows to define and process arbitrary actions on changed data by manipulating all dependent data sources. The prototype comprises a propagation engine that interprets scripts based on a workflow specification language, a data dependency specification tool, a system administration tool, and a repository that stores all relevant information for these tools." VLDB Maintaining Data Privacy in Association Rule Mining. Shariq Rizvi,Jayant R. Haritsa 2002 Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. We investigate here, with respect to mining association rules, whether users can be encouraged to provide correct information by ensuring that the mining process cannot, with any reasonable degree of certainty, violate their privacy. We present a scheme, based on probabilistic distortion of user data, that can simultaneously provide a high degree of privacy to the user and retain a high level of accuracy in the mining results. The performance of the scheme is validated against representative real and synthetic datasets. VLDB FAS - A Freshness-Sensitive Coordination Middleware for a Cluster of OLAP Components. Uwe Röhm,Klemens Böhm,Hans-Jörg Schek,Heiko Schuldt 2002 "Data warehouses offer a compromise between freshness of data and query evaluation times. However, a fixed preference ratio between these two variables is too undifferentiated. With our approach, clients submit a query together with an explicit freshness limit as a new Quality-of-Service parameter. Our architecture is a cluster of databases. The contribution of this article is the design, implementation, and evaluation of a coordination middleware. It schedules and routes updates and queries to cluster nodes, aiming at a high throughput of OLAP queries. The core of the middleware is a new protocol called FAS (Freshness-Aware Scheduling) with the following qualitative characteristics: (1) The requested freshness limit of queries is always met, and (2) data accessed within a transaction is consistent, independent of its freshness. Our evaluation shows that FAS has the following nice properties: OLAP query-evaluation times are close (within 10%) to the ones of an idealistic setup with no updates. FAS allows to effectively trade 'upto-dateness' for query performance. Even when all queries request fresh data, FAS clearly outperforms synchronous replication. Finally, mean response times are independent of the cluster size (up to 128 nodes)." VLDB GeMBASE: A Geometric Mediator for Brain Analysis with Surface Ensembles. Simone Santini,Amarnath Gupta 2002 GeMBASE: A Geometric Mediator for Brain Analysis with Surface Ensembles. VLDB Automation in Information Extraction and Data Integration. Sunita Sarawagi 2002 Automation in Information Extraction and Data Integration. VLDB ALIAS: An Active Learning led Interactive Deduplication System. Sunita Sarawagi,Anuradha Bhamidipaty,Alok Kirpal,Chandra Mouli 2002 Deduplication, a key operation in integrating data from multiple sources, is a time-consuming, labor-intensive and domain-specific operation. We present our design of ALIAS that uses a novel approach to ease this task by limiting the manual effort to inputing simple, domain-specific attribute similarity functions and interactively labeling a small number of record pairs. We describe how active learning is useful in selecting informative examples of duplicates and nonduplicates that can be used to train a deduplication function. ALIAS provides mechanism for efficiently applying the function on large lists of records using a novel cluster-based execution model. VLDB Business Process Cockpit. Mehmet Sayal,Fabio Casati,Umeshwar Dayal,Ming-Chien Shan 2002 Business Process Cockpit. VLDB XMark: A Benchmark for XML Data Management. Albrecht Schmidt,Florian Waas,Martin L. Kersten,Michael J. Carey,Ioana Manolescu,Ralph Busse 2002 While standardization efforts for XML query languages have been progressing, researchers and users increasingly focus on the database technology that has to deliver on the new challenges that the abundance of XML documents poses to data management: validation, performance evaluation and optimization of XML query processors are the upcoming issues. Following a long tradition in database research, we provide a framework to assess the abilities of an XML database to cope with a broad range of different query types typically encountered in real-world scenarios. The benchmark can help both implementors and users to compare XML databases in a standardized application scenario. To this end, we offer a set of queries where each query is intended to challenge a particular aspect of the query processor. The overall workload we propose consists of a scalable document database and a concise, yet comprehensive set of queries which covers the major aspects of XML query processing ranging from textual features to data analysis queries and ad hoc queries. We complement our research with results we obtained from running the benchmark on several XML database platforms. These results are intended to give a first baseline and illustrate the state of the art. VLDB A Multi-version Cache Replacement and Prefetching Policy for Hybrid Data Delivery Environments. André Seifert,Marc H. Scholl 2002 "This paper introduces MICP, a novel multiversion integrated cache replacement and prefetching algorithm designed for efficient cache and transaction management in hybrid data delivery networks. MICP takes into account the dynamically and sporadically changing cost/benefit ratios of cached and/or disseminated object versions by making cache replacement and prefetching decisions sensitive to the objects' access probabilities, their position in the broadcast cycle, and their update frequency. Additionally, to eliminate the issue of a newly created or outdated, but re-cacheable, object version replacing a version that may not be reacquired from the server, MICP logically divides the client cache into two variable-sized partitions, namely the REC and the NON-REC partitions for maintaining re-cacheable and nonre-cacheable object versions, respectively. Besides judiciously selecting replacement victims, MICP selectively prefetches popular object versions from the broadcast channel in order to further improve transaction response time. A simulation study compares MICP with one offline and two online cache replacement and prefetching algorithms. Performance results for the workloads and system settings considered demonstrate that MICP improves transaction throughput rates by about 18.9% compared to the best performing online algorithm and it performs only 40.8% worse than an adapted version of the offline algorithm P." VLDB "Information Integration and XML in IBM's DB2." Patricia G. Selinger 2002 "Information Integration and XML in IBM's DB2." VLDB A Logical Framework for Scheduling Workflows under Resource Allocation Constraints. Pinar Senkul,Michael Kifer,Ismail Hakki Toroslu 2002 A workflow consists of a collection of coordinated tasks designed to carry out a well-defined complex process, such as catalog ordering, trip planning, or a business process in an enterprise. Scheduling of workflows is a problem of finding a correct execution sequence for the workflow tasks, i.e., execution that obeys the constraints that embody the business logic of the workflow. Research on workflow scheduling has largely concentrated on temporal constraints, which specify correct ordering of tasks. Another important class of constraints -- those that arise from resource allocation -- has received relatively little attention in workflow modeling. Since typically resources are not limitless and cannot be shared, scheduling of a workflow execution involves decisions as to which resources to use and when. In this work, we present a framework for workflows whose correctness is given by a set of resource allocation constraints and develop techniques for scheduling such systems. Our framework integrates Concurrent Transaction Logic (CTR) with constraint logic programming (CLP), yielding a new logical formalism, which we call Concurrent Constraint Transaction Logic, or CCTR. VLDB Maintaining Coherency of Dynamic Data in Cooperating Repositories. Shetal Shah,Krithi Ramamritham,Prashant J. Shenoy 2002 "In this paper, we consider techniques for disseminating dynamic data--such as stock prices and real-time weather information--from sources to a set of repositories. We focus on the problem of maintaining coherency of dynamic data items in a network of cooperating repositories. We show that cooperation among repositories-- where each repository pushes updates of data items to other repositories--helps reduce system-wide communication and computation overheads for coherency maintenance. However, contrary to intuition, we also show that increasing the degree of cooperation beyond a certain point can, in fact, be detrimental to the goal of maintaining coherency at low communication and computational overheads. We present techniques (i) to derive the ""optimal"" degree of cooperation among repositories, (ii) to construct an efficient dissemination tree for propagating changes from sources to cooperating repositories, and (iii) to determine when to push an update from one repository to another for coherency maintenance. We evaluate the efficacy of our techniques using real-world traces of dynamically changing data items (specifically, stock prices) and show that careful dissemination of updates through a network of cooperating repositories can substantially lower the cost of coherency maintenance." VLDB Database Tuning: Principles, Experiments, and Troubleshooting Techniques. Dennis Shasha,Philippe Bonnet 2002 Tuning your database for optimal performance means more than following a few short steps in a vendor-specific guide. For maximum improvement, you need a broad and deep knowledge of basic tuning principles, the ability to gather data in a systematic way, and the skill to make your system run faster. This is an art as well as a science, and Database Tuning: Principles, Experiments, and Troubleshooting Techniques will help you develop portable skills that will allow you to tune a wide variety of database systems on a multitude of hardware and operating systems. Further, these skills, combined with the scripts provided for validating results, are exactly what you need to evaluate competing database products and to choose the right one. VLDB EOS: Exactly-Once E-Service Middleware. German Shegalov,Gerhard Weikum,Roger S. Barga,David B. Lomet 2002 EOS: Exactly-Once E-Service Middleware. VLDB SELF-SERV: A Platform for Rapid Composition of Web Services in a Peer-to-Peer Environment. Quan Z. Sheng,Boualem Benatallah,Marlon Dumas,Eileen Oi-Yan Mak 2002 SELF-SERV: A Platform for Rapid Composition of Web Services in a Peer-to-Peer Environment. VLDB Information Management Challenges from the Aerospace Industry. Suryanarayana M. Sripada 2002 The aerospace industry poses significant challenges to information management unlike any other industry. Data management challenges arising from different segments of the aerospace business are identified through illustrative scenarios. These examples and challenges could provide focus and stimulus to further research in information management. VLDB Efficient Exploration of Large Scientific Databases. Etzard Stolte,Gustavo Alonso 2002 One of the challenging aspects of scientific data repositories is how to efficiently explore the catalogues that describe the data. We have encountered such a problem while developing HEDC, HESSI Experimental data center, a multi-terabyte repository built for the recently launched HESSI satellite. In HEDC, scientific users will soon be confronted with a catalogue of many million tuples. In this paper we present a novel technique that allows users to efficiently explore such a large data space in an interactive manner. Our approach is to store a copy of relevant fields in segmented and wavelet encoded views that are streamed to specialized clients. These clients use approximated data and adaptive decoding techniques to allow users to quickly visualize the search space. In the paper we describe how this approach reduces from hours to seconds the time needed to generate meaningful visualizations of millions of tuples. VLDB Adaptive Index Structures. Yufei Tao,Dimitris Papadias 2002 Traditional indexes aim at optimizing the node accesses during query processing, which, however, does not necessarily minimize the total cost due to the possibly large number of random accesses. In this paper, we propose a general framework for adaptive indexes that improve overall query cost. The performance gain is achieved by allowing index nodes to contain a variable number of disk pages. Update algorithms dynamically re-structure adaptive indexes depending on the data and query characteristics. Extensive experiments show that adaptive B- and R-trees significantly outperform their conventional counterparts, while incurring minimal update overhead. VLDB Continuous Nearest Neighbor Search. Yufei Tao,Dimitris Papadias,Qiongmao Shen 2002 "A continuous nearest neighbor query retrieves the nearest neighbor (NN) of every point on a line segment (e.g., ""find all my nearest gas stations during my route from points to point e. The result contains a set of (point, interval) tuples, such that point is the NN of all points in the corresponding interval. Existing methods for continuous nearest neighbor search are based on the repetitive application of simple NN algorithms, which incurs significant overhead. In this paper we propose techniques that solve the problem by performing a single query for the whole input segment. As a result the cost, depending on the query and dataset characteristics, may drop by orders of magnitude. In addition, we propose analytical models for the expected size of the output, as well as, the cost of query processing, and extend out techniques to several variations of the problem." VLDB GnatDb: A Small-Footprint, Secure Database System. Radek Vingralek 2002 This paper describes GnatDb, which is an embedded database system that provides protection against both accidental and malicious corruption of data. GnatDb is designed to run on a wide range of appliances, some of which have very limited resources. Therefore, its design is heavily driven by the need to reduce resource consumption. GnatDb employs atomic and durable updates to protect the data against accidental corruption. It prevents malicious corruption of the data using standard cryptographic techniques that leverage the underlying log-structured storage model. We show that the total memory consumption of GnatDb, which includes the code footprint, the stack and the heap, does not exceed 11 KB, while its performance on a typical appliance platform remains at an acceptable level. VLDB Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering. Gerhard Weikum,Axel Mönkeberg,Christof Hasse,Peter Zabback 2002 "Automatic tuning has been an elusive goal for database technology for a long time and is becoming a pressing issue for modern E-services. This paper reviews and assesses the advances that have been made on this important subject during the last ten years. A major conclusion is that self-tuning database technology should be based on the paradigm of a feedback control loop, but is also bound to build on mathematical models and their proper engineering into system components. In addition, the composition of information services into truly self-tuning, higher-level E-services may require a radical departure towards simpler, highly componentized software architectures with narrow interfaces between RISC-style ""autonomic"" components." VLDB StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. Yunyue Zhu,Dennis Shasha 2002 Consider the problem of monitoring tens of thousands of time series data streams in an online fashion and making decisions based on them. In addition to single stream statistics such as average and standard deviation, we also want to find high correlations among all pairs of streams. A stock market trader might use such a tool to spot arbitrage opportunities. This paper proposes efficient methods for solving this problem based on Discrete Fourier Transforms and a three level time interval hierarchy. Extensive experiments on synthetic data and real world financial trading data show that our algorithm beats the direct computation approach by several orders of magnitude. It also improves on previous Fourier Transform approaches by allowing the efficient computation of time-delayed correlation over any size sliding window and any time delay. Correlation also lends itself to an efficient grid-based data structure. The result is the first algorithm that we know of to compute correlations over thousands of data streams in real time. The algorithm is incremental, has fixed response time, and can monitor the pairwise correlations of 10,000 streams on a single PC. The algorithm is embarrassingly parallelizable. VLDB Compressed Accessibility Map: Efficient Access Control for XML. Ting Yu,Divesh Srivastava,Laks V. S. Lakshmanan,H. V. Jagadish 2002 XML is widely regarded as a promising means for data representation integration, and exchange. As companies transact business over the Internet, the sensitive nature of the information mandates that access must be provided selectively, using sophisticated access control specifications. Using the specification directly to determine if a user has access to a specific XML data item can hence be extremely inefficient. The alternative of fully materializing, for each data item, the users authorized to access it can be space-inefficient. In this paper, we propose a space- and time-efficient solution to the access control problem for XML data. Our solution is based on a novel notion of a compressed accessibility map (CAM), which compactly identifies the XML data items to which a user has access, by exploiting structural locality of accessibility in tree-structured data. We present a CAM lookup algorithm for determining if a user has access to a data item; it takes time proportional to the product of the depth of the item in the XML data and logarithm of the CAM size. VLDB Experience Report: Exploiting Advanced Database Optimization Features for Large-Scale SAP R/3 Installations. Bernhard Zeller,Alfons Kemper 2002 "The database volumes of enterprise resource planning (ERP) systems like SAP R/3 are growing at a tremendous rate and some of them have already reached a size of several Terabytes. OLTP (Online Transaction Processing) databases of this size are hard to maintain and tend to perform poorly. Therefore most database vendors have implemented new features like horizontal partitioning to optimize such mission critical applications. Horizontal partitioning was already investigated in detail in the context of shared nothing distributed database systems but today's ERP systems mostly use a centralized database with a shared everything architecture. In this work, we therefore investigate how an SAP R/3 system performs when the data in the underlying database is partitioned horizontally. Our results show that especially joins, in parallel executed statements, and administrative tasks benefit greatly from horizontal partitioning while the resulting small increase in the execution times of insertions, deletions and updates is tolerable. These positive results have initiated the SAP cooperation partners to pursue a partitioned data layout in some of their largest installed productive systems." SIGMOD Record Phoenix Project: Fault-Tolerant Applications. Roger S. Barga,David B. Lomet 2002 After a system crash, databases recover to the last committed transaction, but applications usually either crash or cannot continue. The Phoenix purpose is to enable application state to persist across system crashes, transparent to the application program. This simplifies application programming, reduces operational costs, masks failures from users, and increases application availability, which is critical in many scenarios, e.g., e-commerce. Within the Phoenix project, we have explored how to provide application recovery efficiently and transparently via redo logging. This paper describes the conceptual framework for the Phoenix project, and the software infrastructure that we are building. SIGMOD Record Book Review Column. Karl Aberer 2002 Book Review Column. SIGMOD Record Book Review Column. Karl Aberer 2002 Book Review Column. SIGMOD Record Book Review Column. Karl Aberer 2002 Book Review Column. SIGMOD Record Book Review Column. Karl Aberer 2002 Book Review Column. SIGMOD Record A Framework for Semantic Gossiping. Karl Aberer,Philippe Cudré-Mauroux,Manfred Hauswirth 2002 Today the problem of semantic interoperability in information search on the Internet is solved mostly by means of centralization, both at a system and at a logical level. This approach has been successful to a certain extent. Peer-to-peer systems as a new brand of system architectures indicate that the principle of decentralization might offer new solutions to many problems that scale well to very large numbers of users.In this paper we outline how the peer-to-peer system architectures can be applied to tackle the problem of semantic interoperability in the large, driven in a bottom-up manner by the participating peers. Such a system can readily be used to study semantic interoperability as a global scale phenomenon taking place in a social network of information sharing peers. SIGMOD Record Data Mining: Concepts and Techniques - Book Review. Fernando Berzal Galiano,Nicolás Marín 2002 Data Mining: Concepts and Techniques - Book Review. SIGMOD Record The p Operator: Discovering and Ranking Associations on the Semantic Web. Kemafor Anyanwu,Amit P. Sheth 2002 The p Operator: Discovering and Ranking Associations on the Semantic Web. SIGMOD Record Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce - Book Review. Antonio Badia 2002 Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce - Book Review. SIGMOD Record Supply Chain Infrastructures: System Integration and Information Sharing. Michael O. Ball,Meng Ma,Louiqa Raschid,Zhengying Zhao 2002 The need for supply chain integration (SCI) methodologies has been increasing as a consequence of the globalization of production and sales, and the advancement of enabling information technologies. In this paper, we describe our experience with implementing and modeling SCIs. We present the integration architecture and the software components of our prototype implementation. We then discuss a variety of information sharing methodologies. Then, within the framework of a multi-echelon supply chain process model spanning multiple organizations, we summarize research on the benefits of intra-organizational knowledge sharing, and we discuss performance scalability. SIGMOD Record "Report on the EDBT'02 Panel on Scientific Data Integration." Omar Boucelma,Silvana Castano,Carole A. Goble,Vanja Josifovski,Zoé Lacroix,Bertram Ludäscher 2002 "Report on the EDBT'02 Panel on Scientific Data Integration." SIGMOD Record The Role of B2B Engines in B2B Integration Architectures. Christoph Bussler 2002 Semantic B2B Integration architectures must enable enterprises to communicate standards-based B2B events like purchase orders with any potential trading partner. This requires not only back end application integration capabilities to integrate with e.g. enterprise resource planning (ERP) systems as the company-internal source and destination of B2B events, but also a capability to implement every necessary B2B protocol like Electronic Data Interchange (EDI), RosettaNet as well as more generic capabilities like web services (WS). This paper shows the placement and functionality of B2B engines in semantic B2B integration architectures that implement a generic framework for modeling and executing any B2B protocol. A detailed discussion shows how a B2B engine can provide the necessary abstractions to implement any standard-based B2B protocol or any trading partner specific specialization. SIGMOD Record A Conceptual Architecture for Semantic Web Enabled Web Services. Christoph Bussler,Dieter Fensel,Alexander Maedche 2002 Semantic Web Enabled Web Services (SWWS) will transform the web from a static collection of information into a distributed device of computation on the basis of Semantic technology making content within the World Wide Web machine-processable and machine-interpretable. Semantic Web Enabled Web Services will allow the automatic discovery, selection and execution of inter-organization business logic making areas like dynamic supply chain composition a reality. In this paper we introduce the vision of Semantic Web Enabled Web Services, describe requirements for building semantics-driven web services and sketch a first draft of conceptual architecture for implementing semantic web enabled web services. SIGMOD Record Querying Multiple Bioinformatics Data Sources: Can Semantic Web Research Help? David Buttler,Matthew Coleman,Terence Critchlow,Renato Fileto,Wei Han,Ling Liu,Calton Pu,Daniel Rocco,Li Xiong 2002 Querying Multiple Bioinformatics Data Sources: Can Semantic Web Research Help? SIGMOD Record From Ternary Relationship to Relational Tables: A Case Against Common Beliefs. Rafael Camps 2002 From Ternary Relationship to Relational Tables: A Case Against Common Beliefs. SIGMOD Record An Active Functionality Service for E-Business Applications. Mariano Cilia,Alejandro P. Buchmann 2002 Service based architectures are a powerful approach to meet the fast evolution of business rules and the corresponding software. An active functionality service that detects events and involves the appropriate business rules is a critical component of such a service-based middleware architecture. In this paper we present an active functionality service that is capable of detecting events in heterogeneous environments, it uses an integral ontology-based approach for the semantic interpretation of heterogeneous events and data, and provides notifications through a publish/subscribe notification mechanism. The power of this approach is illustrated with the help of an auction application and through the personalization of car and driver portals in Internet-enabled vehicles. SIGMOD Record Supporting Global User Profiles Through Trusted Authorities. Ibrahim Cingil 2002 "Personalization generally refers to making a Web site more responsive to the unique and individual needs of each user. We argue that for personalization to work effectively, detailed and interoperable user profiles should be globally available for authorized sites, and these profiles should dynamically reflect the changes in user interests.Creating user profiles from user click-stream data seems to be an effective way of generating detailed and dynamic user profiles. However a user profile generated in this way is available only on the computer where the user accesses his browser, and is inaccessable when the same user works on a different computer. On the other hand, the integration of Internet with telecommunication networks have made it possible for the users to connect to Web with a variety of mobile devices as well as desk tops. This requires that user profiles should be available to any desktop or mobile device on the Internet that users choose to work with.In this paper, we address these problems through the concept of ""Trusted Authority"". A user agent at the client side that captures the user click stream, dynamically generates a navigational history 'log' file in Extensible Markup Language (XML). This log files is then used to produce the 'user profiles' in Resource Description Framework (RDF). User's right to privacy is provided through the Platform for Privacy Preferences (P3P) standard. User profiles are uploaded to the trusted authority and served next time the user connects to the Web.The trusted authority concept, serving as a namespace qualifier, provides globally unique userid/password identification for users. Furthermore user profiles dynamically reflect the changes in their interests since the data generated while they are surfing the Web contribute to their profile. Also since the user profiles are defined in RDF, they are interoperable and available to any type of authorized device on the Internet." SIGMOD Record Advances in Databases and Information Systems (ADBIS). Albertas Caplinskas,Johann Eder,Olegas Vasilecas 2002 Advances in Databases and Information Systems (ADBIS). SIGMOD Record Report on the 24th European Colloquium on Information Retrieval Research. Fabio Crestani,Mark Girolami 2002 Report on the 24th European Colloquium on Information Retrieval Research. SIGMOD Record Research Activities in Database Management and Information Retrieval at the University of Illinois at Chicago. Isabel F. Cruz,Ashfaq A. Khokhar,Bing Liu,A. Prasad Sistla,Ouri Wolfson,Clement T. Yu 2002 Research Activities in Database Management and Information Retrieval at the University of Illinois at Chicago. SIGMOD Record Semantic B2B Integration: Issues in Ontology-based Applications. "Zhan Cui,Dean M. Jones,Paul O'Brien" 2002 "Solving queries to support e-commerce transactions can involve retrieving and integrating information from multiple information resources. Often, users don't care which resources are used to answer their query. In such situations, the ideal solution would be to hide from the user the details of the resources involved in solving a particular query. An example would be providing seamless access to a set of heterogeneous electronic product catalogues. There are many problems that must be addressed before such a solution can be provided. In this paper, we discuss a number of these problems, indicate how we have addressed these and go on to describe the proof-of-concept demonstration system we have developed." SIGMOD Record "Guest Editor's Introduction." Asuman Dogac 2002 "Guest Editor's Introduction." SIGMOD Record Agents, Turst, and Information Access on the Semantic Web. Timothy W. Finin,Anupam Joshi 2002 Agents, Turst, and Information Access on the Semantic Web. SIGMOD Record SQL/XML is Making Good Progress. Andrew Eisenberg,Jim Melton 2002 "Not very long ago, we discussed the creation of a new part of SQL, XML-Related Specifications (SQL/XML), in this column [1]. At the time, we referred to the work that had been done as ""infrastructure"". We are pleased to be able to say that significant progress has been made, and SQL/XML [2] is now going out for the first formal stage of processing, Final Committee Draft (FCD) ballot, in ISO/IEC JTC1.In our previous column, we described the mapping of SQL ⟨identifier⟩s to XML Names, SQL data types to XML Schema data types, and SQL values to XML values. There have been a few small corrections and enhancements in these areas, but for the most part the descriptions in our previous column are still accurate.Thc new work that we will discuss in this column comes in three parts. The first part provides a mapping from a single table, all tables in a schema, or all tables in a catalog to an XML document. The second of these parts includes the creation of an XML data type in SQL and adds functions that create values of this new type. These functions allow a user to produce XML from existing SQL data. Finally, the ""infrastructure"" work that we described in our previous article included the mapping of SQL's predefined data types to XML Schema data types. This mapping has been extended to include the mapping of domains, distinct types, row types, arrays, and multisets.The FCD ballot that we mentioned began in early April. This will allow the comments contained in the ballot responses to be discussed at the Editing Meeting in September or October of this year. We expect the Editing Meeting to recommend progression to Final Draft International Status (FDIS) ballot, which suggests that an International Standard will be published by the middle of 2003." SIGMOD Record Report on the Semantic Web Workshop at WWW 2002. Martin R. Frank,Natalya Fridman Noy,Steffen Staab 2002 Report on the Semantic Web Workshop at WWW 2002. SIGMOD Record Letter to SIGMOD Record Editor. Nazih Elderini 2002 Letter to SIGMOD Record Editor. SIGMOD Record A Multi-Agent System Infrastructure for Software Component Market-Place: An Ontological Perspective. Riza Cenk Erdur,Oguz Dikenelli 2002 In this paper, we introduce a multi-agent system architecture and an implemented prototype for software component market-place. We emphasize the ontological perspective by discussing the ontology modeling for component market-place, UML extensions for ontology modeling, and the idea of ontology transfer which makes the multi-agent system to adapt itself to the dynamically changing ontologies. SIGMOD Record Combining Fuzzy Information: an Overview. Ronald Fagin 2002 Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). Each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, such as min or average. In this overview, we discuss and compare algorithms for determining the top k objects, that is, k objects with the highest overall grades. SIGMOD Record Conceptual Modeling and Specification Generation for B2B Business Process based on ebXML. HyoungDo Kim 2002 In order to support dynamic setup of business processes among independent organizations, a formal standard schema for describing the business processes is basically required. The ebXML framework provides such a specification schema called BPSS (Business Process Specification Schema) which is available in two stand-alone representations: a UML version, and an XML version. The former, however, is not intended for the direct creation of business process specifications, but for defining specification elements and their relationships required for creating an ebXML-compliant business process specification. For this reason, it is very important to support conceptual modeling that is well organized and directly matched with major modeling concepts. This paper deals with how to represent and manage B2B business processes using UML-compliant diagrams. The major challenge is to organize UML diagrams in a natural way that is well suited with the business process meta-model and then to transform the diagrams into an XML version. This paper demonstrates the usefulness of conceptually modeling business processes by prototyping a business process editor tool called ebDesigner. SIGMOD Record Data Mining: Practical Machine Learning Tools and Techniques - Book Review. James Geller 2002 Data Mining: Practical Machine Learning Tools and Techniques - Book Review. SIGMOD Record Report on the 18th British National Conference on Databases (BNCOD). Carole A. Goble,Brian J. Read 2002 "The annual series of the British National Conference on Databases has been a forum for UK database practitioners and a focus for database research since 1981. In recent years, interest in this conference series has extended well beyond the UK.BNCOD 2001, the 18th conference in the series, was held at the CLRC Rutherford Appleton Laboratory (RAL) from 9th -11th July 2001. RAL hosts national large-scale facilities for advanced scientific research. The Information Technology Department collaborates with the Laboratory's data centres that manage terabytes of data in remote sensing, high-energy physics and astronomy.BNCOD 2001 included scientific papers, invited talks, a panel and a poster session. The BNCOD Programme Committee, chaired by Professor Carole Goble of Manchester University, selected for presentation at the meeting eleven papers, about one third of those papers submitted. Contributors were drawn from the Netherlands, Germany, Sweden, Canada and USA, as well as the UK. The audience of 60 attendees was chiefly drawn from the UK database community. The Proceedings are published by Springer-Verlag in the Lecture Notes in Computer Science series, and are available online at: http://link.springer.de/link/service/series/0558/tocs/t2097.htm." SIGMOD Record The Grid: An Application of the Semantic Web. Carole A. Goble,David De Roure 2002 "The Grid is an emerging platform to support on-demand ""virtual organisations"" for coordinated resource sharing and problem solving on a global scale. The application thrust is large-scale scientific endeavour, and the scale and complexity of scientific data presents challenges for databases. The Grid is beginning to exploit technologies developed for Web Services and to realise its potential it also stands to benefit from Semantic Web technologies; conversely, the Grid and its scientific users provide application pull which will benefit the Semantic Web." SIGMOD Record Data Warehousing and Business Intelligence for E-Commerce - Book Review. Frank G. Goethals 2002 Data Warehousing and Business Intelligence for E-Commerce - Book Review. SIGMOD Record A Study on the Management of Semantic Transaction for Efficient Data Retrieval. Shi-Ming Huang,Irene S. Y. Kwan,Chih-He Li 2002 Mobile computing technology is developing rapidly due to the advantages of information access through mobile devices and the need to retrieve information at remote locations. However, many obstacles within the discipline of wireless computing are yet to be resolved. One of the most significant of these issues is the speed of data retrieval, which directly affects the performance of mobile database applications. To remedy this problem, we propose here a revised methodology focusing on the management of mobile transactions. This paper investigates an extended semantic-based transaction management mechanism, and applies a model-based approach for developing a simulation model to evaluate the performance of our approach. SIGMOD Record Contracting in the Days of eBusiness. Wolfgang Hümmer,Wolfgang Lehner,Hartmut Wedekind 2002 Putting electronic business on a sound foundation --- model theoretically as well as technologically --- has to be seen as a central challenge for research as well as for commercial development. This paper concentrates on the discovery and the negotiation phase of concluding an agreement based on a contract. We present a methodology how to come seamlessly from a many-to-many relationship in the discovery phase to a one-to-one relationship in the contract negotiation phase. Making the content of the contracts persistent is achieved by reconstructing contract templates by means of mereologic (logic of the whole-part relation). Possibly nested sub-structures of the contract template are taken as a basis for negotiation in a dialogical way. For the negotiation itself the contract templates are extended by implications (logical) and sequences (topical). SIGMOD Record What Will Be - Book Review. Paul W. P. J. Grefen 2002 What Will Be - Book Review. SIGMOD Record Parameterized Complexity for the Database Theorist. Martin Grohe 2002 Parameterized Complexity for the Database Theorist. SIGMOD Record Emergent Semantics and the Multimedia Semantic Web. William I. Grosky,D. V. Sreenath,Farshad Fotouhi 2002 "It is well known that context plays an important role in the meaning of a work of art. This paper addresses the dynamic context of a collection of linked multimedia documents, of which the web is a perfect example. Contextual document semantics emerge through identification of various users' browsing paths though this multimedia collection. In this paper, we present techniques that use multimedia information as part of this determination. Some implications of our approach are that the author of a webpage cannot completely define that document's semantics and that semantics emerge through use." SIGMOD Record Cluster Validity Methods: Part I. Maria Halkidi,Yannis Batistakis,Michalis Vazirgiannis 2002 Clustering is an unsupervised process since there are no predefined classes and no examples that would indicate grouping properties in the data set. The majority of the clustering algorithms behave differently depending on the features of the data set and the initial assumptions for defining groups. Therefore, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity. Evaluating and assessing the results of a clustering algorithm is the main subject of cluster validity. In this paper we present a review of the clustering validity and methods. More specifically, Part I of the paper discusses the cluster validity approaches based on external and internal criteria. SIGMOD Record Clustering Validity Checking Methods: Part II. Maria Halkidi,Yannis Batistakis,Michalis Vazirgiannis 2002 Clustering results validation is an important topic in the context of pattern recognition. We review approaches and systems in this context. In the first part of this paper we presented clustering validity checking approaches based on internal and external criteria. In the second, current part, we present a review of clustering validity approaches based on relative criteria. Also we discuss the results of an experimental study based on widely known validity indices. Finally the paper illustrates the issues that are under-addressed by the recent approaches and proposes the research directions in the field. SIGMOD Record Report on the ACM Fourth International Workshop on Data Warehousing and OLAP (DOLAP 2001). Joachim Hammer 2002 Report on the ACM Fourth International Workshop on Data Warehousing and OLAP (DOLAP 2001). SIGMOD Record "Treasurer's Message." Joachim Hammer 2002 "Treasurer's Message." SIGMOD Record Databases and Transaction Processing: An Application-Oriented Approach - Book Review. Alfons Kemper 2002 Databases and Transaction Processing: An Application-Oriented Approach - Book Review. SIGMOD Record Interviewing During a Tight Job Market. Qiong Luo,Zachary G. Ives 2002 Interviewing During a Tight Job Market. SIGMOD Record Toward Autonomic Computing with DB2 Universal Database. Sam Lightstone,Guy M. Lohman,Daniel C. Zilio 2002 "As the cost of both hardware and software falls due to technological advancements and economies of scale, the cost of ownership for database applications is increasingly dominated by the cost of people to manage them. Databases are growing rapidly in scale and complexity, while skilled database administrators (DBAs) are becoming rarer and more expensive. This paper describes the self-managing or autonomic technology in IBM's DB2 Universal Database® for UNIX and Windows to illustrate how self-managing technology can reduce complexity, helping to reduce the total cost of ownership (TCO) of DBMSs and improve system performance." SIGMOD Record Report on the NSF Workshop on Building an Infrastructure for Mobile and Wireless Systems. Birgitta König-Ries,Kia Makki,S. A. M. Makki,Charles E. Perkins,Niki Pissinou,Peter L. Reiher,Peter Scheuermann,Jari Veijalainen,Ouri Wolfson 2002 Report on the NSF Workshop on Building an Infrastructure for Mobile and Wireless Systems. SIGMOD Record "Editor's Notes." Ling Liu 2002 "Editor's Notes." SIGMOD Record MPEG-7 and Multimedia Database Systems. Harald Kosch 2002 The Multimedia Description Standard MPEG-7 is an International Standard since February 2002. It defines a huge set of description classes for multimedia content, for its creation and its communication. This article investigates what MPEG-7 means to Multimedia Database Systems (MMDBSs) and vice versa. We argue that MPEG-7 has to be considered complementary to, rather than competing with, data models employed in MMDBSs. Finally we show by an example scenario how these technologies can reasonably complement one another. SIGMOD Record "Editor's Notes." Ling Liu 2002 "Editor's Notes." SIGMOD Record Report on the Web Dynamics Workshop at WWW 2002. Mark Levene,Alexandra Poulovassilis 2002 Report on the Web Dynamics Workshop at WWW 2002. SIGMOD Record "Editor's Notes." Ling Liu 2002 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2002 "Editor's Notes." SIGMOD Record Introduction to Constraint Databases - Book Review. Bart Kuijpers 2002 Introduction to Constraint Databases - Book Review. SIGMOD Record A Brief Survey of Web Data Extraction Tools. Alberto H. F. Laender,Berthier A. Ribeiro-Neto,Altigran Soares da Silva,Juliana S. Teixeira 2002 In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data. SIGMOD Record Conceptual Model of Web Service Reputation. E. Michael Maximilien,Munindar P. Singh 2002 Current Web services standards enable publishing service descriptions and finding services on a match based on criteria such as method signatures or service category. However, current approaches provide no basis for selecting a good service or for comparing ratings of services. We describe a conceptual model for reputation using which reputation information can be organized and shared and service selection can be facilitated and automated. SIGMOD Record "Guest Editor's Introduction." Robert Meersman,Amit P. Sheth 2002 "Guest Editor's Introduction." SIGMOD Record An Early Look at XQuery. Jim Melton,Andrew Eisenberg 2002 An Early Look at XQuery. SIGMOD Record SQL/MED - A Status Report. Jim Melton,Jan-Eike Michels,Vanja Josifovski,Krishna G. Kulkarni,Peter M. Schwarz 2002 SQL/MED - A Status Report. SIGMOD Record The Design of a Retrieval Technique for High-Dimensional Data on Tertiary Storage. Ratko Orlandic,Jack Lukaszuk,Craig Swietlik 2002 "In high-energy physics experiments, large particle accelerators produce enormous quantities of data, measured in hundreds of terabytes or petabytes per year, which are deposited onto tertiary storage. The experiments are designed to study the collisions of fundamental particles, called ""events"", each of which is represented as a point in a multi-dimensional universe. In these environments, the best retrieval performance can be achieved only if the data is clustered on the tertiary storage by all searchable attributes of the events. Since the number of these attributes is high, the underlying data-management facility must be able to cope with extremely large volumes and very high dimensionalities of data at the same time. The proposed indexing technique is designed to facilitate both clustering and efficient retrieval of high-dimensional data on tertiary storage. The structure uses an original space-partitioning scheme, which has numerous advantages over other space-partitioning techniques. While the main objective of the design is to support high-energy physics experiments, the proposed solution is appropriate for many other scientific applications." SIGMOD Record Mining the World Wide Web: An Information Search Approach - Book Review. Aris M. Ouksel 2002 Mining the World Wide Web: An Information Search Approach - Book Review. SIGMOD Record Automata Theory for XML Researchers. Frank Neven 2002 Automata Theory for XML Researchers. SIGMOD Record "Chair's Message." M. Tamer Özsu 2002 "Chair's Message." SIGMOD Record "Chair's Message." M. Tamer Özsu 2002 "Chair's Message." SIGMOD Record "Chair's Message." M. Tamer Özsu 2002 "Chair's Message." SIGMOD Record A Pictorial Query Language for Querying Geographic Databases using Positional and OLAP Operators. Elaheh Pourabbas,Maurizio Rafanelli 2002 The authors propose a declarative Pictorial Query Language (called PQL) that is able to express queries on an Object-Oriented geographic database drawing the features which form the query. These features refer to the classic ones of a geographic environment (geo-null, geo-points, geo-polyline, and geo-region) and define the alphabet of the above mentioned language. This language, extended with respect to a previous one, considers twelve positional operators and a set of their specifications. Moreover, the possibility to use the mentioned language to query multidimensional databases is discussed. Finally, the characteristic of the mentioned language by a query example is shown. SIGMOD Record Business Data Management for B2B Electronic Commerce. Christoph Quix,Mareike Schoop,Manfred A. Jeusfeld 2002 Business Data Management for B2B Electronic Commerce. SIGMOD Record Research in Information Managment at Dublin City University. Mark Roantree,Alan F. Smeaton 2002 Research in Information Managment at Dublin City University. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Alfred V. Aho,Anastassia Ailamaki 2002 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Amr El Abbadi,Theodore Johnson,Richard T. Snodgrass 2002 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Mary F. Fernandez,Kyuseok Shim 2002 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Flip Korn,Renée J. Miller,Kaladhar Voruganti 2002 Reminiscences on Influential Papers. SIGMOD Record Data Modelling versus Ontology Engineering. Peter Spyns,Robert Meersman,Mustafa Jarrar 2002 "Ontologies in current computer science parlance are computer based resources that represent agreed domain semantics. Unlike data models, the fundamental asset of ontologies is their relative independence of particular applications, i.e. an ontology consists of relatively generic knowledge that can be reused by different kinds of applications/tasks. The first part of this paper concerns some aspects that help to understand the differences and similarities between ontologies and data models. In the second part we present an ontology engineering framework that supports and favours the genericity of an ontology. We introduce the DOGMA ontology engineering approach that separates ""atomic"" conceptual relations from ""predicative"" domain rules. A DOGMA ontology consists of an ontology base that holds sets of intuitive context-specific conceptual relations and a layer of ""relatively generic"" ontological commitments that hold the domain rules. This constitutes what we shall call the double articulation of a DOGMA ontology 1." SIGMOD Record Amicalola Report: Database and Information System Research Challenges and Opportunities in Semantic Web and Enterprises. Amit P. Sheth,Robert Meersman 2002 Amicalola Report: Database and Information System Research Challenges and Opportunities in Semantic Web and Enterprises. SIGMOD Record Too Much Middleware. Michael Stonebraker 2002 The movement from client-server computing to multi-tier computing has created a potpourri of so-called middleware systems, including application servers, workflow products, EAI systems, ETL systems and federated data systems. In this paper we argue that the explosion in middleware has created a myriad of poorly integrated systems with overlapping functionality. The world would be well served by considerable consolidation, and we present some of the ways this might happen. Some of the points covered in this paper have been previously explored in [BERN96]. SIGMOD Record Report on the Mobile Search Workshop at WWW 2002. Aya Soffer,Yoëlle S. Maarek,Bay-Wei Chang 2002 Report on the Mobile Search Workshop at WWW 2002. SIGMOD Record The n-Tier Hub Technology. Rainer A. Sommer,Thomas R. Gulledge,David Bailey 2002 During 2001, the Enterprise Engineering Laboratory at George Mason University was contracted by the Boeing Company to develop an eHub capability for aerospace suppliers in Taiwan. In a laboratory environment, the core technology was designed, developed, and tested, and now a large first-tier aerospace supplier in Taiwan is commercializing the technology. The project objective was to provide layered network and application services for transporting XML-based business transaction flows across multi-tier, heterogeneous data processing environments. This paper documents the business scenario, the eHub application, and the network transport mechanisms that were used to build the n-tier hub. Contrary to most eHubs, this solution takes the point of view of suppliers, pushing data in accordance with supplier requirements; hence, enhancing the probability of supplier adoption. The unique contribution of this project is the development of an eHub that meets the needs of Small and Medium Enterprises (SMEs) and first-tier suppliers. SIGMOD Record Bringing Order to Query Optimization. Giedrius Slivinskas,Christian S. Jensen,Richard T. Snodgrass 2002 A variety of developments combine to highlight the need for respecting order when manipulating relations. For example, new functionality is being added to SQL to support OLAP-style querying in which order is frequently an important aspect. The set- or multiset-based frameworks for query optimization that are currently being taught to database students are increasingly inadequate.This paper presents a foundation for query optimization that extends existing frameworks to also capture ordering. A list-based relational algebra is provided along with three progressively stronger types of algebraic equivalences, concrete query transformation rules that obey the different equivalences, and a procedure for determining which types of transformation rules are applicable for optimizing a query. The exposition follows the style chosen by many textbooks, making it relatively easy to teach this material in continuation of the material covered in the textbooks, and to integrate this material into the textbooks. SIGMOD Record Why I Like Working in Academia. Richard T. Snodgrass 2002 "When Alex Labrinidis asked me to write this essay, I initially balked. I was loathe to speak for academics worldwide, or even just those in SIGMOD. But I then realized that I could speak from personal experience. So these random musings will be of necessity entirely subjective, highly individualistic, and unrepresentative---attributes that a scholar normally attempts to vigorously avoid in his writing. I'm definitely not a ""typical"" academic (I don't know such an animal), but I can speak with some authority as to what motivates me.As another caveat, I make few comparisons with alternatives such as working in a research lab or as a developer. I won't even attempt to speak for them.The final caveat (distrust all commentaries that start with caveats, but perhaps more so those that don't!) is that my assumed audience comprises students who are considering such a profession. Current academics will find some of my observations trite or may disagree loudly, as academics are oft to do (see below).That said, I have been an academic for exactly twenty years, and I deeply love the academic life. While I have consulted for and written papers with those away from the ivory tower, my professional life has been entirely as a professor. I went forthwith from undergraduate school to graduate school, then directly to the University of North Carolina, then to the University of Arizona, where I am happily ensconced.I open with some disadvantages to this seemingly ideal life, then turn to the advantages. With each I start with those that I expected when I was a doctoral student, and then consider those I (naïvely or otherwise) was not aware of from that early vantage point." SIGMOD Record TODS Perceptions and Misconceptions. Richard T. Snodgrass 2002 TODS Perceptions and Misconceptions. SIGMOD Record The XML Typechecking Problem. Dan Suciu 2002 "When an XML document conforms to a given type (e.g. a DTD or an XML Schema type) it is called a valid document. Checking if a given XML document is valid is called the validation problem, and is typically performed by a parser (hence, validating parser), more precisely it is performed right after parsing, by the same program module. In practice however XML documents are often generated dynamically, by some program: checking whether all XML documents generated by the program are valid w.r.t. a given type is called the typechecking problem. While a validation analyzes an XML document, a type checker analyzes a program, and the problem's difficulty is a function of the language in which that program is expressed. The XML typechecking problem has been investigated recently in [MSV00, HP00, HVP00, AMN+01a, AMN+01b] and the XQuery Working Group adopted some of these techniques for typechecking XQuery [FFM+01]). All these techniques, however, have limitations which need to be understood and further explored and investigated. In this paper we define the XML typechecking problem, and present current approaches to typechecking, discussing their limitations." SIGMOD Record Rights of TODS Readers, Authors and Reviewers. Richard T. Snodgrass 2002 Rights of TODS Readers, Authors and Reviewers. SIGMOD Record Methodology for Development and Employment of Ontology Based Knowledge Management Applications. York Sure,Steffen Staab,Rudi Studer 2002 In this article we illustrate a methodology for introducing and maintaining ontology based knowledge management applications into enterprises with a focus on Knowledge Processes and Knowledge Meta Processes. While the former process circles around the usage of ontologies, the latter process guides their initial set up. We illustrate our methodology by an example from a case study on skills management. SIGMOD Record Small Worlds: the Dynamics of Networks between Order and Randomness - Book Review. Jie Wu 2002 Small Worlds: the Dynamics of Networks between Order and Randomness - Book Review. SIGMOD Record The Design and Performance Evaluation of Alternative XML Storage Strategies. Feng Tian,David J. DeWitt,Jianjun Chen,Chun Zhang 2002 This paper studies five strategies for storing XML documents including one that leaves documents in the file system, three that use a relational database system, and one that uses an object manager. We implement and evaluate each approach using a number of XQuery queries. A number of interesting insights are gained from these experiments and a summary of the advantages and disadvantages of the approaches is presented. SIGMOD Record Foundations of Statistical Natural Language Processing - Book Review. Gerhard Weikum 2002 Foundations of Statistical Natural Language Processing - Book Review. SIGMOD Record Investigating XQuery for Querying Across Database Object Types. Nancy Wiegand 2002 In addition to facilitating querying over the Web, XML query languages may provide high level constructs for useful facilities in traditional DBMSs that do not currently exist. In particular, current DBMS query languages do not allow querying across database object types to yield heterogeneous results. This paper motivates the usefulness of heterogeneous querying in traditional DBMSs and investigates XQuery, an emerging standard for XML query languages, to express such queries. The usefulness of querying and storing heterogeneous types is also applied to XML data within a Web information system. SIGMOD Record Interview with Avi Silberschatz. Marianne Winslett 2002 Interview with Avi Silberschatz. SIGMOD Record David DeWitt Speaks Out. Marianne Winslett 2002 David DeWitt Speaks Out. SIGMOD Record Hector Garcia-Molina Speaks Out. Marianne Winslett 2002 Hector Garcia-Molina Speaks Out. SIGMOD Record Interview with David Maier. Marianne Winslett 2002 Interview with David Maier. SIGMOD Record Database Research at the University of Illinois at Urbana-Champaign. Marianne Winslett,Kevin Chen-Chuan Chang,AnHai Doan,Jiawei Han,ChengXiang Zhai,Yuanyuan Zhou 2002 Database Research at the University of Illinois at Urbana-Champaign. SIGMOD Record The Cougar Approach to In-Network Query Processing in Sensor Networks. Yong Yao,Johannes Gehrke 2002 The widespread distribution and availability of small-scale sensors, actuators, and embedded processors is transforming the physical world into a computing platform. One such example is a sensor network consisting of a large number of sensor nodes that combine physical sensing capabilities such as temperature, light, or seismic sensors with networking and computation capabilities. Applications range from environmental control, warehouse inventory, and health care to military environments. Existing sensor networks assume that the sensors are preprogrammed and send data to a central frontend where the data is aggregated and stored for offline querying and analysis. This approach has two major drawbacks. First, the user cannot change the behavior of the system on the fly. Second, conservation of battery power is a major design factor, but a central system cannot make use of in-network programming, which trades costly communication for cheap local computation.In this paper, we introduce the Cougar approach to tasking sensor networks through declarative queries. Given a user query, a query optimizer generates an efficient query plan for in-network query processing, which can vastly reduce resource usage and thus extend the lifetime of a sensor network. In addition, since queries are asked in a declarative language, the user is shielded from the physical characteristics of the network. We give a short overview of sensor networks, propose a natural architecture for a data management system for sensor networks, and describe open research problems in this area. ICDE Personalized Services for Mobile Route Planning. Wolf-Tilo Balke,Werner Kießling,Christoph Unbehend 2003 Personalized Services for Mobile Route Planning. ICDE User Interaction in the BANKS System. B. Aditya,Soumen Chakrabarti,Rushi Desai,Arvind Hulgeri,Hrishikesh Karambelkar,Rupesh Nasre,Parag,S. Sudarshan 2003 User Interaction in the BANKS System. ICDE QRS: A Robust Numbering Scheme for XML Documents. Toshiyuki Amagasa,Masatoshi Yoshikawa,Shunsuke Uemura 2003 QRS: A Robust Numbering Scheme for XML Documents. ICDE PIX: A System for Phrase Matching in XML Documents. Sihem Amer-Yahia,Mary F. Fernández,Divesh Srivastava,Yu Xu 2003 PIX: A System for Phrase Matching in XML Documents. ICDE XML Publishing: Look at Siblings too! Sihem Amer-Yahia,Yannis Kotidis,Divesh Srivastava 2003 XML Publishing: Look at Siblings too! ICDE Approximate Matching in XML. Sihem Amer-Yahia,Nick Koudas,Divesh Srivastava 2003 Approximate Matching in XML. ICDE Scalable template-based query containment checking for web semantic caches. Khalil Amiri,Sanghyun Park,Renu Tewari,Sriram Padmanabhan 2003 Scalable template-based query containment checking for web semantic caches. ICDE Querying Text Databases for Efficient Information Extraction. Eugene Agichtein,Luis Gravano 2003 Querying Text Databases for Efficient Information Extraction. ICDE DBProxy: A dynamic data cache for Web applications. Khalil Amiri,Sanghyun Park,Renu Tewari,Sriram Padmanabhan 2003 DBProxy: A dynamic data cache for Web applications. ICDE Database Technologies for E- Commerce. Rakesh Agrawal 2003 Database Technologies for E- Commerce. ICDE Automating Layout of Relational Databases. Rakesh Agrawal,Surajit Chaudhuri,Abhinandan Das,Vivek R. Narasayya 2003 Automating Layout of Relational Databases. ICDE Extracting Structured Data from Web Pages. Arvind Arasu,Hector Garcia-Molina 2003 Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases. ICDE Implementing P3P Using Database Technology. Rakesh Agrawal,Jerry Kiernan,Ramakrishnan Srikant,Yirong Xu 2003 Implementing P3P Using Database Technology. ICDE Efficient Computation of Subqueries in Complex OLAP. Michael O. Akinde,Michael H. Böhlen 2003 Efficient Computation of Subqueries in Complex OLAP. ICDE XomatiQ: Living With Genomes, Proteomes, Relations and a Little Bit of XML. Sourav S. Bhowmick,Pedro Cruz,Amey V. Laud 2003 XomatiQ: Living With Genomes, Proteomes, Relations and a Little Bit of XML. ICDE Streaming XPath Processing with Forward and Backward Axes. Charles Barton,Philippe Charles,Deepak Goyal,Mukund Raghavachari,Marcus Fontoura,Vanja Josifovski 2003 Streaming XPath Processing with Forward and Backward Axes. ICDE Cyber Infrastructure for Informatics. Chaitan Baru 2003 Cyber Infrastructure for Informatics. ICDE Handling Evolutions in Multidimensional Structures. Mathurin Body,Maryvonne Miquel,Yvan Bédard,Anne Tchounikine 2003 Handling Evolutions in Multidimensional Structures. ICDE Web Information Acquisition with Lixto Suite. Robert Baumgartner,Michal Ceresna,Georg Gottlob,Marcus Herzog,Viktor Zigo 2003 Web Information Acquisition with Lixto Suite. ICDE Bridging the XML Relational Divide with LegoDB. Philip Bohannon,Juliana Freire,Jayant R. Haritsa,Maya Ramanath,Prasan Roy,Jérôme Siméon 2003 Bridging the XML Relational Divide with LegoDB. ICDE Efficient Creation of Statistics over Query Expressions. Nicolas Bruno,Surajit Chaudhuri 2003 Efficient Creation of Statistics over Query Expressions. ICDE Navigation- vs. Index-Based XML Multi-Query Processing. Nicolas Bruno,Luis Gravano,Nick Koudas,Divesh Srivastava 2003 Navigation- vs. Index-Based XML Multi-Query Processing. ICDE Dynamic Access Control Framework Based On Events. Manish Bhide,Sandeep Pandey,Ajay Gupta,Mukesh K. Mohania 2003 Dynamic Access Control Framework Based On Events. ICDE Storage and Retrieval of XML Data using Relational Databases. Surajit Chaudhuri,Kyuseok Shim 2003 Storage and Retrieval of XML Data using Relational Databases. ICDE SWAT: Hierarchical Stream Summarization in Large Networks. Ahmet Bulut,Ambuj K. Singh 2003 SWAT: Hierarchical Stream Summarization in Large Networks. ICDE Scalable Application-Aware Data Freshening. Donald Carney,Sangdon Lee,Stanley B. Zdonik 2003 Scalable Application-Aware Data Freshening. ICDE Design and Implementation of a Temporal Extension of SQL. Cindy Xinmin Chen,Jiejun Kong,Carlo Zaniolo 2003 Design and Implementation of a Temporal Extension of SQL. ICDE Shared Cache - The Future of Parallel Databases. Sashikanth Chandrasekaran,Roger Bamford 2003 Shared Cache - The Future of Parallel Databases. ICDE Querying Imprecise Data in Moving Object Environments. Reynold Cheng,Sunil Prabhakar,Dmitri V. Kalashnikov 2003 Querying Imprecise Data in Moving Object Environments. ICDE Profile-Driven Cache Management. Mitch Cherniack,Eduardo F. Galvez,Michael J. Franklin,Stanley B. Zdonik 2003 Profile-Driven Cache Management. ICDE Skyline with Presorting. Jan Chomicki,Parke Godfrey,Jarek Gryz,Dongming Liang 2003 Skyline with Presorting. ICDE Data Engineering for Mobile and Wireless Access. Panos K. Chrysanthis,Vijay Kumar,Evaggelia Pitoura 2003 Data Engineering for Mobile and Wireless Access. ICDE Language Models for Information Retrieval. W. Bruce Croft 2003 Language Models for Information Retrieval. ICDE Propagating XML Constraints to Relations. Susan B. Davidson,Wenfei Fan,Carmem S. Hara,Jing Qin 2003 Propagating XML Constraints to Relations. ICDE Preference-Driven Query Processing. Pin-Kwang Eng,Beng Chin Ooi,Hua Soon Sim,Kian-Lee Tan 2003 Preference-Driven Query Processing. ICDE A Heuristic for Refresh Policy Selection in Heterogeneous Environments. Henrik Engström,Sharma Chakravarthy,Brian Lings 2003 A Heuristic for Refresh Policy Selection in Heterogeneous Environments. ICDE The Constraint Database Framework: Lessons Learned from CQA/CDB. Dina Q. Goldin,Ayferi Kutlu,Mingjun Song,Fuzheng Yang 2003 The Constraint Database Framework: Lessons Learned from CQA/CDB. ICDE enTrans: A Demonstration of Flexible Consistency Maintenance in Provisioning Systems. Shaymsunder Gopale,Shridhar Shukla,R. Kul,R. Jha 2003 enTrans: A Demonstration of Flexible Consistency Maintenance in Provisioning Systems. ICDE XPath Query Evaluation: Improving Time and Space Efficiency. Georg Gottlob,Christoph Koch,Reinhard Pichler 2003 XPath Query Evaluation: Improving Time and Space Efficiency. ICDE Text Joins for Data Cleansing and Integration in an RDBMS. Luis Gravano,Panagiotis G. Ipeirotis,Nick Koudas,Divesh Srivastava 2003 Text Joins for Data Cleansing and Integration in an RDBMS. ICDE Index-Based Approximate XML Joins. Sudipto Guha,Nick Koudas,Divesh Srivastava,Ting Yu 2003 Index-Based Approximate XML Joins. ICDE Towards Bringing Database Management Task in the Realm of IT non-Experts. Ajay Gupta,Manish Bhide,Mukesh K. Mohania 2003 Towards Bringing Database Management Task in the Realm of IT non-Experts. ICDE Schema Mediation in Peer Data Management Systems. Alon Y. Halevy,Zachary G. Ives,Dan Suciu,Igor Tatarinov 2003 Schema Mediation in Peer Data Management Systems. ICDE HD-Eye - Visual Clustering of High dimensional Data. Alexander Hinneburg,Daniel A. Keim,Markus Wawryniuk 2003 Clustering of large data bases is an important research area with a large variety of applications in the data base context. Missing in most of the research efforts are means for guiding the clustering process and understanding the results, which is especially important for high dimensional data. Visualization technology may help to solve this problem since it provides effective support of different clustering paradigms and allows a visual inspection of the results. The HD-Eye (high-dim. eye) system shows that a tight integration of advanced clustering algorithms and state-of-the-art visualization techniques is powerful for a better understanding and effective guidance of the clustering process, and therefore can help to significantly improve the clustering results. ICDE Keyword Proximity Search on XML Graphs. Vagelis Hristidis,Yannis Papakonstantinou,Andrey Balmin 2003 Keyword Proximity Search on XML Graphs. ICDE PXML: A Probabilistic Semistructured Data Model and Algebra. Edward Hung,Lise Getoor,V. S. Subrahmanian 2003 PXML: A Probabilistic Semistructured Data Model and Algebra. ICDE Broadcasting Dependent Data for Ordered Queries without Replication in a Multi-Channel Mobile Environment. Jiun-Long Huang,Ming-Syan Chen,Wen-Chih Peng 2003 Broadcasting Dependent Data for Ordered Queries without Replication in a Multi-Channel Mobile Environment. ICDE The QUIQ Engine: A Hybrid IR DB System. Navin Kabra,Raghu Ramakrishnan,Vuk Ercegovac 2003 The QUIQ Engine: A Hybrid IR DB System. ICDE Fast alignment of large genome databases. Tamer Kahveci,Ambuj K. Singh 2003 Fast alignment of large genome databases. ICDE Evaluating Window Joins over Unbounded Streams. Jaewoo Kang,Jeffrey F. Naughton,Stratis Viglas 2003 Evaluating Window Joins over Unbounded Streams. ICDE Out-of-the Box Data Engineering - Events in Heterogeneous Environments. Ramesh Jain 2003 Out-of-the Box Data Engineering - Events in Heterogeneous Environments. ICDE Spatial Processing using Oracle Table Functions. Kothuri Venkata Ravi Kanth,Siva Ravada,W. Xu 2003 Spatial Processing using Oracle Table Functions. ICDE XR-Tree: Indexing XML Data for Efficient Structural Joins. Haifeng Jiang,Hongjun Lu,Wei Wang,Beng Chin Ooi 2003 XR-Tree: Indexing XML Data for Efficient Structural Joins. ICDE An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing. Hui Jin,Beng Chin Ooi,Heng Tao Shen,Cui Yu,Aoying Zhou 2003 An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing. ICDE Managing Data Mappings in the Hyperion Project. Anastasios Kementsietsidis,Marcelo Arenas,Renée J. Miller 2003 Managing Data Mappings in the Hyperion Project. ICDE Databases for Ambient Intelligence. Martin L. Kersten 2003 Databases for Ambient Intelligence. ICDE Joining Massive High-Dimensional Datasets. Tamer Kahveci,Christian A. Lang,Ambuj K. Singh 2003 Joining Massive High-Dimensional Datasets. ICDE Super-Fast XML Wrapper Generation in DB2: A Demonstration. Vanja Josifovski,Sabine Massmann,Felix Naumann 2003 Super-Fast XML Wrapper Generation in DB2: A Demonstration. ICDE Querying XML data sources in DB2: the XML Wrapper. Vanja Josifovski,Peter M. Schwarz 2003 Querying XML data sources in DB2: the XML Wrapper. ICDE Multicasting a Changing Repository. Wang Lam,Hector Garcia-Molina 2003 Multicasting a Changing Repository. ICDE Discovery of High-Dimensional. Andreas Koeller,Elke A. Rundensteiner 2003 Discovery of High-Dimensional. ICDE Capturing Sensor-Generated Time Series with Quality Guarantees. Iosif Lazaridis,Sharad Mehrotra 2003 Capturing Sensor-Generated Time Series with Quality Guarantees. ICDE Index Hint for On-demand Broadcasting. Sangdon Lee,Donald Carney,Stanley B. Zdonik 2003 Index Hint for On-demand Broadcasting. ICDE Low Overhead Optimal Checkpointing for Mobile Distributed Systems. L. Kumar,Muldip Mishra,Ramesh C. Joshi 2003 Low Overhead Optimal Checkpointing for Mobile Distributed Systems. ICDE A Database for Storage and Fast Retrieval of Structure Data. S. Kumar,S. Srinivasa 2003 A Database for Storage and Fast Retrieval of Structure Data. ICDE An Optimized Multicast-based Data Dissemination Middleware. Wei Li,Wenhui Zhang,Vincenzo Liberatore,Vince Penkrot,Jonathan Beaver,Mohamed A. Sharaf,Siddhartha Roychowdhury,Panos K. Chrysanthis,Kirk Pruhs 2003 An Optimized Multicast-based Data Dissemination Middleware. ICDE What Makes the Differences: Benchmarking XML Database Implementations. Hongjun Lu,Jeffrey Xu Yu,Guoren Wang,Shihui Zheng,Haifeng Jiang,Ge Yu,Aoying Zhou 2003 "XML is emerging as a major standard for representing data on the World Wide Web. Recently, many XML storage models have been proposed to manage XML data. In order to assess an XML database's abilities to deal with XML queries, several benchmarks have also been proposed, including XMark and XMach. However, no reported studies using those benchmarks were found that can provide users with insights on the impacts of a variety of storage models on XML query performance. In this article, we report our first set of results on benchmarking a set of XML database implementations using two XML benchmarks. The selected implementations represent a wide range of approaches, including RDBMS-based systems with document-independent and document-dependent XML-relational schema mapping approaches, and XML native engines based on an Object-Oriented Model and the Document Object Model. Comprehensive experiments were conducted to study relative performance of different approaches and the important issues that affect XML query performance, such as path expression query processing, effectiveness of various partitioning, label-path, and indexing structures." ICDE A Comparison of Three Methods for Join View Maintenance in Parallel RDBMS. Gang Luo,Jeffrey F. Naughton,Curt J. Ellmann,Michael Watzke 2003 A Comparison of Three Methods for Join View Maintenance in Parallel RDBMS. ICDE Similarity Search in Sets and Categorical Data Using the Signature Tree. Nikos Mamoulis,David W. Cheung,Wang Lian 2003 Similarity Search in Sets and Categorical Data Using the Signature Tree. ICDE Catalog Integration Made Easy. Pedro José Marrón,Georg Lausen,Martin Weber 2003 Catalog Integration Made Easy. ICDE Data Integration by Bi-Directional Schema Transformation Rules. Peter McBrien,Alexandra Poulovassilis 2003 Data Integration by Bi-Directional Schema Transformation Rules. ICDE SG-WRAM Schema Guided Wrapper Maintenance. Xiaofeng Meng,Haiyan Wang,Dongdong Hu,Mingzhe Gu 2003 SG-WRAM Schema Guided Wrapper Maintenance. ICDE Visual Querying and Exploration of Large Answers in XML Databases with X2. Holger Meuss,Klaus U. Schulz,François Bry 2003 Visual Querying and Exploration of Large Answers in XML Databases with X2. ICDE Application Servers and Associated Technologies. C. Mohan 2003 Application Servers and Associated Technologies. ICDE Spectral LPM: An Optimal Locality-Preserving Mapping using the Spectral (not Fractal) Order. Mohamed F. Mokbel,Walid G. Aref,Ananth Grama 2003 Spectral LPM: An Optimal Locality-Preserving Mapping using the Spectral (not Fractal) Order. ICDE Business Process Management Systems. Anil Nori 2003 Business Process Management Systems. ICDE Supporting Ancillary Values from User Defined Functions in Oracle. Ravi Murthy,Ying Hu,Seema Sundara,Timothy Chorma,Nipun Agarwal,Jagannathan Srinivasan 2003 Supporting Ancillary Values from User Defined Functions in Oracle. ICDE An Evaluation of Regular Path Expressions with Qualifiers against XML Streams. Dan Olteanu,Tobias Kiesling,François Bry 2003 An Evaluation of Regular Path Expressions with Qualifiers against XML Streams. ICDE PeerDB: A P2P-based System for Distributed Data Sharing. Wee Siong Ng,Beng Chin Ooi,Kian-Lee Tan,Aoying Zhou 2003 PeerDB: A P2P-based System for Distributed Data Sharing. ICDE Personalized Portals for the Wireless and Mobile User; a Mobile Agent Approach. Christoforos Panayiotou,George Samaras 2003 Personalized Portals for the Wireless and Mobile User; a Mobile Agent Approach. ICDE StegFS: A Steganographic File System. HweeHwa Pang,Kian-Lee Tan,Xuan Zhou 2003 StegFS: A Steganographic File System. ICDE LOCI: Fast Outlier Detection Using the Local Correlation Integral. Spiros Papadimitriou,Hiroyuki Kitagawa,Phillip B. Gibbons,Christos Faloutsos 2003 LOCI: Fast Outlier Detection Using the Local Correlation Integral. ICDE Streaming XPath Queries in XSQ. Feng Peng,Sudarshan S. Chawathe 2003 Streaming XPath Queries in XSQ. ICDE Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star Join Processing. Roland Pieringer,Klaus Elhardt,Frank Ramsak,Volker Markl,Robert Fenk,Rudolf Bayer,Nikos Karayannidis,Aris Tsois,Timos K. Sellis 2003 Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star Join Processing. ICDE Generalized Closed Itemsets for Association Rule Mining. Vikram Pudi,Jayant R. Haritsa 2003 Generalized Closed Itemsets for Association Rule Mining. ICDE Representing Web Graphs. Sriram Raghavan,Hector Garcia-Molina 2003 Representing Web Graphs. ICDE Using State Modules for Adaptive Query Processing. Vijayshankar Raman,Amol Deshpande,Joseph M. Hellerstein 2003 Using State Modules for Adaptive Query Processing. ICDE MEMS-based Disk Buffer for Streaming Media Servers. Raju Rangaswami,Zoran Dimitrijevic,Edward Y. Chang,Klaus E. Schauser 2003 MEMS-based Disk Buffer for Streaming Media Servers. ICDE Flux: An Adaptive Partitioning Operator for Continuous Query Systems. Mehul A. Shah,Joseph M. Hellerstein,Sirish Chandrasekaran,Michael J. Franklin 2003 The long-running nature of continuous queries coupled with their high scalability requirements poses new challanges for dataflow processing. CQ systems execute pipelined dataflows that are shared across multiple queries and whose scalability is limited by their constituent, stateful operators -- e.g. a windowed groupby-aggregate. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Long-running CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called Flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producer-consumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the Flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the Flux mechanism and these policies can provide several factors improvement in throughput, and orders of magnitude improvement in average latency over the static case. ICDE Distance Based Indexing for String Proximity Search. Süleyman Cenk Sahinalp,Murat Tasan,Jai Macker,Z. Meral Özsoyoglu 2003 Distance Based Indexing for String Proximity Search. ICDE Sequence Data Mining Techniques and Applications. Sunita Sarawagi 2003 Sequence Data Mining Techniques and Applications. ICDE Scaling up the ALIAS Duplicate Elimination System. Sunita Sarawagi,Alok Kirpal 2003 Scaling up the ALIAS Duplicate Elimination System. ICDE HDoV-tree: The Structure, The Storage, The Speed. Lidan Shou,Zhiyong Huang,Kian-Lee Tan 2003 HDoV-tree: The Structure, The Storage, The Speed. ICDE A Framework for the Selective Dissemination of XML Documents based on Inferred User Profiles. Ioana Stanoi,George A. Mihaila,Sriram Padmanabhan 2003 A Framework for the Selective Dissemination of XML Documents based on Inferred User Profiles. ICDE Coordinated Management of Cascaded Caches for Efficient Content Distribution. Xueyan Tang,Samuel T. Chanson 2003 Coordinated Management of Cascaded Caches for Efficient Content Distribution. ICDE Selectivity Estimation for Predictive Spatio-Temporal Queries. Yufei Tao,Jimeng Sun,Dimitris Papadias 2003 Selectivity Estimation for Predictive Spatio-Temporal Queries. ICDE Ranked Join Indices. Panayiotis Tsaparas,Themistoklis Palpanas,Yannis Kotidis,Nick Koudas,Divesh Srivastava 2003 Ranked Join Indices. ICDE GRAM++ - An Indigenous GIS Software Suite Demonstration. P. Venkatachalam,B. K. Mohan 2003 GRAM++ - An Indigenous GIS Software Suite Demonstration. ICDE X-Diff: An Effective Change Detection Algorithm for XML Documents. Yuan Wang,David J. DeWitt,Jin-yi Cai 2003 X-Diff: An Effective Change Detection Algorithm for XML Documents. ICDE PBiTree Coding and Efficient Processing of Containment Joins. Wei Wang,Haifeng Jiang,Hongjun Lu,Jeffrey Xu Yu 2003 PBiTree Coding and Efficient Processing of Containment Joins. ICDE Pushing Aggregate Constraints by Divide-and-Approximate. Ke Wang,Yuelong Jiang,Jeffrey Xu Yu,Guozhu Dong,Jiawei Han 2003 Pushing Aggregate Constraints by Divide-and-Approximate. ICDE Indexing Weighted-Sequences in Large Databases. Haixun Wang,Chang-Shing Perng,Wei Fan,Sanghyun Park,Philip S. Yu 2003 Indexing Weighted-Sequences in Large Databases. ICDE Structural Join Order Selection for XML Query Optimization. Yuqing Wu,Jignesh M. Patel,H. V. Jagadish 2003 Structural Join Order Selection for XML Query Optimization. ICDE Mining Customer Value: From Association Rules to Direct Marketing. Ke Wang,Senqiang Zhou,Jack Man Shun Yeung,Qiang Yang 2003 Mining Customer Value: From Association Rules to Direct Marketing. ICDE Energy Efficient Index for Querying Location-Dependent Data in Mobile Broadcast Environments. Jianliang Xu,Baihua Zheng,Wang-Chien Lee,Dik Lun Lee 2003 Energy Efficient Index for Querying Location-Dependent Data in Mobile Broadcast Environments. ICDE Dynamic Clustering of Evolving Streams with a Single Pass. Jiong Yang 2003 Dynamic Clustering of Evolving Streams with a Single Pass. ICDE Designing a Super-Peer Network. Beverly Yang,Hector Garcia-Molina 2003 Designing a Super-Peer Network. ICDE CLUSEQ: Efficient and Effective Sequence Clustering. Jiong Yang,Wei Wang 2003 CLUSEQ: Efficient and Effective Sequence Clustering. ICDE Fast Data Access on Multiple Broadcast Channels. Wai Gen Yee,Shamkant B. Navathe 2003 Fast Data Access on Multiple Broadcast Channels. ICDE Efficient Maintenance of Materialized Top-k Views. Ke Yi,Hai Yu,Jun Yang,Gangqiang Xia,Yuguo Chen 2003 Efficient Maintenance of Materialized Top-k Views. ICDE Medical Video Mining for Efficient Database Indexing, Management and Access. Xingquan Zhu,Walid G. Aref,Jianping Fan,Ann Christine Catlin,Ahmed K. Elmagarmid 2003 Medical Video Mining for Efficient Database Indexing, Management and Access. SIGMOD Conference STREAM: The Stanford Stream Data Manager. Arvind Arasu,Brian Babcock,Shivnath Babu,Mayur Datar,Keith Ito,Itaru Nishizawa,Justin Rosenstein,Jennifer Widom 2003 STREAM: The Stanford Stream Data Manager. SIGMOD Conference Extracting Structured Data from Web Pages. Arvind Arasu,Hector Garcia-Molina 2003 Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases. SIGMOD Conference Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. Brian Babcock,Shivnath Babu,Mayur Datar,Rajeev Motwani 2003 Chain : Operator Scheduling for Memory Minimization in Data Stream Systems. SIGMOD Conference Dynamic Sample Selection for Approximate Query Processing. Brian Babcock,Surajit Chaudhuri,Gautam Das 2003 In decision support applications, the ability to provide fast approximate answers to aggregation queries is desirable. One commonly-used technique for approximate query answering is sampling. For many aggregation queries, appropriately constructed biased (non-uniform) samples can provide more accurate approximations than a uniform sample. The optimal type of bias, however, varies from query to query. In this paper, we describe an approximate query processing technique that dynamically constructs an appropriately biased sample for each query by combining samples selected from a family of non-uniform samples that are constructed during a pre-processing phase. We show that dynamic selection of appropriate portions of previously constructed samples can provide more accurate approximate answers than static, non-adaptive usage of uniform or non-uniform samples. SIGMOD Conference Distributed Top-K Monitoring. Brian Babcock,Chris Olston 2003 "The querying and analysis of data streams has been a topic of much recent interest, motivated by applications from the fields of networking, web usage analysis, sensor instrumentation, telecommunications, and others. Many of these applications involve monitoring answers to continuous queries over data streams produced at physically distributed locations, and most previous approaches require streams to be transmitted to a single location for centralized processing. Unfortunately, the continual transmission of a large number of rapid data streams to a central location can be impractical or expensive. We study a useful class of queries that continuously report the k largest values obtained from distributed data streams (""top-k monitoring queries""), which are of particular interest because they can be used to reduce the overhead incurred while running other types of monitoring queries. We show that transmitting entire data streams is unnecessary to support these queries and present an alternative approach that reduces communication significantly. In our approach, arithmetic constraints are maintained at remote stream sources to ensure that the most recently provided top-k answer remains valid to within a user-specified error tolerance. Distributed communication is only necessary on occasion, when constraints are violated, and we show empirically through extensive simulation on real-world data that our approach reduces overall communication cost by an order of magnitude compared with alternatives that o er the same error guarantees." SIGMOD Conference Aurora: A Data Stream Management System. Daniel J. Abadi,Donald Carney,Ugur Çetintemel,Mitch Cherniack,Christian Convey,C. Erwin,Eduardo F. Galvez,M. Hatoun,Anurag Maskey,Alex Rasin,A. Singer,Michael Stonebraker,Nesime Tatbul,Ying Xing,R. Yan,Stanley B. Zdonik 2003 Aurora: A Data Stream Management System. SIGMOD Conference A Framework for Change Diagnosis of Data Streams. Charu C. Aggarwal 2003 A Framework for Change Diagnosis of Data Streams. SIGMOD Conference Dynamic XML documents with distribution and replication. Serge Abiteboul,Angela Bonifati,Gregory Cobena,Ioana Manolescu,Tova Milo 2003 "The advent of XML as a universal exchange format, and of Web services as a basis for distributed computing, has fostered the apparition of a new class of documents: dynamic XML documents. These are XML documents where some data is given explicitly while other parts are given only intensionally by means of embedded calls to web services that can be called to generate the required information. By the sole presence of Web services, dynamic documents already include inherently some form of distributed computation. A higher level of distribution that also allows (fragments of) dynamic documents to be distributed and/or replicated over several sites is highly desirable in today's Web architecture, and in fact is also relevant for regular (non dynamic) documents.The goal of this paper is to study new issues raised by the distribution and replication of dynamic XML data. Our study has originated in the context of the Active XML system [1, 3, 22] but the results are applicable to many other systems supporting dynamic XML data. Starting from a data model and a query language, we describe a complete framework for distributed and replicated dynamic XML documents. We provide a comprehensive cost model for query evaluation and show how it applies to user queries and service calls. Finally, we describe an algorithm that, for a given peer, chooses data and services that the peer should replicate to improve the efficiency of maintaining and querying its dynamic data." SIGMOD Conference Information Sharing Across Private Databases. Rakesh Agrawal,Alexandre V. Evfimievski,Ramakrishnan Srikant 2003 Literature on information integration across databases tacitly assumes that the data in each database can be revealed to the other databases. However, there is an increasing need for sharing information across autonomous entities in such a way that no information apart from the answer to the query is revealed. We formalize the notion of minimal information sharing across private databases, and develop protocols for intersection, equijoin, intersection size, and equijoin size. We also show how new applications can be built using the proposed protocols. SIGMOD Conference A System for Watermarking Relational Databases. Rakesh Agrawal,Peter J. Haas,Jerry Kiernan 2003 A System for Watermarking Relational Databases. SIGMOD Conference QXtract: A Building Block for Efficient Information Extraction from Plain-Text Databases. Eugene Agichtein,Luis Gravano 2003 QXtract: A Building Block for Efficient Information Extraction from Plain-Text Databases. SIGMOD Conference Querying Structured Text in an XML Database. Shurug Al-Khalifa,Cong Yu,H. V. Jagadish 2003 "XML databases often contain documents comprising structured text. Therefore, it is important to integrate ""information retrieval style"" query evaluation, which is well-suited for natural language text, with standard ""database style"" query evaluation, which handles structured queries efficiently. Relevance scoring is central to information retrieval. In the case of XML, this operation becomes more complex because the data required for scoring could reside not directly in an element itself but also in its descendant elements.In this paper, we propose a bulk-algebra, TIX, and describe how it can be used as a basis for integrating information retrieval techniques into a standard pipelined database query evaluation engine. We develop new evaluation strategies essential to obtaining good performance, including a stack-based TermJoin algorithm for efficiently scoring composite elements. We report results from an extensive experimental evaluation, which show, among other things, that the new TermJoin access method outperforms a direct implementation of the same functionality using standard operators by a large factor." SIGMOD Conference Capturing both Types and Constraints in Data Integration. Michael Benedikt,Chee Yong Chan,Wenfei Fan,Juliana Freire,Rajeev Rastogi 2003 We propose a framework for integrating data from multiple relational sources into an XML document that both conforms to a given DTD and satisfies predefined XML constraints. The framework is based on a specification language, AIG, that extends a DTD by (1) associating element types with semantic attributes (inherited and synthesized, inspired by the corresponding notions from Attribute Grammars), (2) computing these attributes via parameterized SQL queries over multiple data sources, and (3) incorporating XML keys and inclusion constraints. The novelty of AIG consists in semantic attributes and their dependency relations for controlling context-dependent, DTD-directed construction of XML documents, as well as for checking XML constraints in parallel with document-generation. We also present cost-based optimization techniques for efficiently evaluating AIGs, including algorithms for merging queries and for scheduling queries on multiple data sources. This provides a new grammar-based approach for data integration under both syntactic and semantic constraints. SIGMOD Conference PIX: Exact and Approximate Phrase Matching in XML. Sihem Amer-Yahia,Mary F. Fernández,Divesh Srivastava,Yu Xu 2003 PIX: Exact and Approximate Phrase Matching in XML. SIGMOD Conference Statistical Schema Matching across Web Query Interfaces. Bin He,Kevin Chen-Chuan Chang 2003 "Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this ""deep Web,"" we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy." SIGMOD Conference Formal semantics and analysis of object queries. Gavin M. Bierman 2003 Modern database systems provide not only powerful data models but also complex query languages supporting powerful features such as the ability to create new database objects and invocation of arbitrary methods (possibly written in a third-party programming language).In this sense query languages have evolved into powerful programming languages. Surprisingly little work exists utilizing techniques from programming language research to specify and analyse these query languages. This paper provides a formal, high-level operational semantics for a complex-value OQL-like query language that can create fresh database objects, and invoke external methods. We define a type system for our query language and prove an important soundness property.We define a simple effect typing discipline to delimit the computational effects within our queries. We prove that this effect system is correct and show how it can be used to detect cases of non-determinism and to define correct query optimizations. SIGMOD Conference ROLEX: Relational On-Line Exchange with XML. Philip Bohannon,Xin Dong,Sumit Ganguly,Henry F. Korth,Chengkai Li,P. P. S. Narayan,Pradeep Shenoy 2003 ROLEX: Relational On-Line Exchange with XML. SIGMOD Conference DBCache: Middle-tier Database Caching for Highly Scalable e-Business Architectures. Christof Bornhövd,Mehmet Altinel,Sailesh Krishnamurthy,C. Mohan,Hamid Pirahesh,Berthold Reinwald 2003 DBCache: Middle-tier Database Caching for Highly Scalable e-Business Architectures. SIGMOD Conference The Future of Web Services - I. Adam Bosworth 2003 The Future of Web Services - I. SIGMOD Conference The Future of Web services - II. Felipe Cabrera 2003 The Future of Web services - II. SIGMOD Conference XQuery: A Query Language for XML. Donald D. Chamberlin 2003 XQuery is the XML query language currently under development in the World Wide Web Consortium (W3C). XQuery specifications have been published in a series of W3C working drafts, and several reference implementations of the language are already available on the Web. If successful, XQuery has the potential to be one of the most important new computer languages to be introduced in several years. This tutorial will provide an overview of the syntax and semantics of XQuery, as well as insight into the principles that guided the design of the language. SIGMOD Conference TelegraphCQ: Continuous Dataflow Processing. Sirish Chandrasekaran,Owen Cooper,Amol Deshpande,Michael J. Franklin,Joseph M. Hellerstein,Wei Hong,Sailesh Krishnamurthy,Samuel Madden,Frederick Reiss,Mehul A. Shah 2003 TelegraphCQ: Continuous Dataflow Processing. SIGMOD Conference Robust and Efficient Fuzzy Match for Online Data Cleaning. Surajit Chaudhuri,Kris Ganjam,Venkatesh Ganti,Rajeev Motwani 2003 To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets. SIGMOD Conference Factorizing Complex Predicates in Queries to Exploit Indexes. Surajit Chaudhuri,Prasanna Ganesan,Sunita Sarawagi 2003 Decision-support applications generate queries with complex predicates. We show how the factorization of complex query expressions exposes significant opportunities for exploiting available indexes. We also present a novel idea of relaxing predicates in a complex condition to create possibilities for factoring. Our algorithms are designed for easy integration with existing query optimizers and support multiple optimization levels, providing different trade-offs between plan complexity and optimization time. SIGMOD Conference On Relational Support for XML Publishing: Beyond Sorting and Tagging. Surajit Chaudhuri,Raghav Kaushik,Jeffrey F. Naughton 2003 In this paper, we study whether the need for efficient XML publishing brings any new requirements for relational query engines, or if sorting query results in the relational engine and tagging them in middleware is sufficient. We observe that the mismatch between the XML data model and the relational model requires relational engines to be enhanced for efficiency. Specifically, they need to support relation valued variables. We discuss how such support can be provided through the addition of an operator, GApply, with minimal extensions to existing relational engines. We discuss how the operator may be exposed in SQL syntax and provide a comprehensive study of optimization rules that govern this operator. We report the results of a preliminary performance evaluation showing the speedup obtained through our approach and the effectiveness of our optimization rules. SIGMOD Conference Evaluating Probabilistic Queries over Imprecise Data. Reynold Cheng,Dmitri V. Kalashnikov,Sunil Prabhakar 2003 Many applications employ sensors for monitoring entities such as temperature and wind speed. A centralized database tracks these entities to enable query processing. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), it is often infeasible to store the exact values at all times. A similar situation exists for moving object environments that track the constantly changing locations of objects. In this environment, it is possible for database queries to produce incorrect or invalid results based upon old data. However, if the degree of error (or uncertainty) between the actual value and the database value is controlled, one can place more confidence in the answers to queries. More generally, query answers can be augmented with probabilistic estimates of the validity of the answers. In this paper we study probabilistic query evaluation based upon uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments are performed to examine the effectiveness of several data update policies. SIGMOD Conference LockX: A System for Efficiently Querying Secure XML. SungRan Cho,Sihem Amer-Yahia,Laks V. S. Lakshmanan,Divesh Srivastava 2003 LockX: A System for Efficiently Querying Secure XML. SIGMOD Conference Spectral Bloom Filters. Saar Cohen,Yossi Matias 2003 A Bloom Filter is a space-efficient randomized data structure allowing membership queries over sets with certain allowable errors. It is widely used in many applications which take advantage of its ability to compactly represent a set, and filter out effectively any element that does not belong to the set, with small error probability. This paper introduces the Spectral Bloom Filter (SBF), an extension of the original Bloom Filter to multi-sets, allowing the filtering of elements whose multiplicities are below a threshold given at query time. Using memory only slightly larger than that of the original Bloom Filter, the SBF supports queries on the multiplicities of individual keys with a guaranteed, small error probability. The SBF also supports insertions and deletions over the data set. We present novel methods for reducing the probability and magnitude of errors. We also present an efficient data structure and algorithms to build it incrementally and maintain it over streaming data, as well as over materialized data with arbitrary insertions and deletions. The SBF does not assume any a priori filtering threshold and effectively and efficiently maintains information over the entire data-set, allowing for ad-hoc queries with arbitrary parameters and enabling a range of new applications. SIGMOD Conference Data Management Challenges in CRM. George Colliat 2003 Data Management Challenges in CRM. SIGMOD Conference Gigascope: A Stream Database for Network Applications. Charles D. Cranor,Theodore Johnson,Oliver Spatscheck,Vladislav Shkapenyuk 2003 We have developed Gigascope, a stream database for network applications including traffic analysis, intrusion detection, router configuration analysis, network research, network monitoring, and performance monitoring and debugging. Gigascope is undergoing installation at many sites within the AT&T network, including at OC48 routers, for detailed monitoring. In this paper we describe our motivation for and constraints in developing Gigascope, the Gigascope architecture and query language, and performance issues. We conclude with a discussion of stream database research problems we have found in our application. SIGMOD Conference Contorting High Dimensional Data for Efficient Main Memory Processing. Bin Cui,Beng Chin Ooi,Jianwen Su,Kian-Lee Tan 2003 Contorting High Dimensional Data for Efficient Main Memory Processing. SIGMOD Conference Approximate Join Processing Over Data Streams. Abhinandan Das,Johannes Gehrke,Mirek Riedewald 2003 We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We first discuss alternate architectural models for data stream join processing, and we survey suitable measures for the quality of an approximation of a set-valued query result. We then consider the number of generated result tuples as the quality measure, and we give optimal offline and fast online algorithms for it. In a thorough experimental study with synthetic and real data we show the efficacy of our solutions. For applications with demand for exact results we introduce a new Archive-metric which captures the amount of work needed to complete the join in case the streams are archived for later processing. SIGMOD Conference A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding. David DeHaan,David Toman,Mariano P. Consens,M. Tamer Özsu 2003 The W3C XQuery language recommendation, based on a hierarchical and ordered document model, supports a wide variety of constructs and use cases. There is a diversity of approaches and strategies for evaluating XQuery expressions, in many cases only dealing with limited subsets of the language. In this paper we describe an implementation approach that handles XQuery with arbitrarily-nested FLWR expressions, element constructors and built-in functions (including structural comparisons). Our proposal maps an XQuery expression to a single equivalent SQL query using a novel dynamic interval encoding of a collection of XML documents as relations, augmented with information tied to the query evaluation environment. The dynamic interval technique enables (suitably enhanced) relational engines to produce predictably good query plans that do not preclude the use of sort-merge join query operators. The benefits are realized despite the challenges presented by intermediate results that create arbitrary documents and the need to preserve document order as prescribed by semantics of XQuery. Finally, our experimental results demonstrate that (native or relational) XML systems can benefit from the above technique to avoid a quadratic scale up penalty that effectively prevents the evaluation of nested FLWR expressions for large documents. SIGMOD Conference Extended Wavelets for Multiple Measures. Antonios Deligiannakis,Nick Roussopoulos 2003 While work in recent years has demonstrated that wavelets can be efficiently used to compress large quantities of data and provide fast and fairly accurate answers to queries, little emphasis has been placed on using wavelets in approximating datasets containing multiple measures. Existing decomposition approaches will either operate on each measure individually, or treat all measures as a vector of values and process them simultaneously. We show in this paper that the resulting individual or combined storage approaches for the wavelet coefficients of different measures that stem from these existing algorithms may lead to suboptimal storage utilization, which results to reduced accuracy to queries. To alleviate this problem, we introduce in this work the notion of an extended wavelet coefficient as a flexible storage method for the wavelet coefficients, and propose novel algorithms for selecting which extended wavelet coefficients to retain under a given storage constraint. Experimental results with both real and synthetic datasets demonstrate that our approach achieves improved accuracy to queries when compared to existing techniques. SIGMOD Conference Cache-and-Query for Wide Area Sensor Databases. Amol Deshpande,Suman Nath,Phillip B. Gibbons,Srinivasan Seshan 2003 Webcams, microphones, pressure gauges and other sensors provide exciting new opportunities for querying and monitoring the physical world. In this paper we focus on querying wide area sensor databases, containing (XML) data derived from sensors spread over tens to thousands of miles. We present the first scalable system for executing XPATH queries on such databases. The system maintains the logical view of the data as a single XML document, while physically the data is fragmented across any number of host nodes. For scalability, sensor data is stored close to the sensors, but can be cached elsewhere as dictated by the queries. Our design enables self starting distributed queries that jump directly to the lowest common ancestor of the query result, dramatically reducing query response times. We present a novel query-evaluate gather technique (using XSLT) for detecting (1) which data in a local database fragment is part of the query result, and (2) how to gather the missing parts. We define partitioning and cache invariants that ensure that even partial matches on cached data are exploited and that correct answers are returned, despite our dynamic query-driven caching. Experimental results demonstrate that our techniques dramatically increase query throughputs and decrease query response times in wide area sensor databases. SIGMOD Conference IrisNet: Internet-scale Resource-Intensive Sensor Services. Amol Deshpande,Suman Nath,Phillip B. Gibbons,Srinivasan Seshan 2003 IrisNet: Internet-scale Resource-Intensive Sensor Services. SIGMOD Conference Temporal Coalescing with Now, Granularity, and Incomplete Information. Curtis E. Dyreson 2003 This paper presents a novel strategy for temporal coalescing. Temporal coalescing merges the temporal extents of value-equivalent tuples. A temporal extent is usually coalesced offline and stored since coalescing is an expensive operation. But the temporal extent of a tuple with now, times at different granularities, or incomplete times cannot be determined until query evaluation. This paper presents a strategy to partially coalesce temporal extents by identifying regions that are potentially covered. The covered regions can be used to evaluate temporal predicates and constructors on the coalesced extent. Our strategy uses standard relational database technology. We quantify the cost using the Oracle DBMS. SIGMOD Conference Efficient similarity search and classification via rank aggregation. Ronald Fagin,Ravi Kumar,D. Sivakumar 2003 "We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent ""voters"" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.One instantiation of our methodology is as follows. Each voter projects all the vectors (database elements and the query) on a random line (different for each voter), and ranks the database elements based on the proximity of the projections to the projection of the query. The aggregation rule picks the database element that has the best median rank. This combination has several appealing features. On the theoretical side, we prove that with high probability, it produces a result that is a (1 + ε) factor approximation to the Euclidean nearest neighbor. On the practical side, it turns out to be extremely efficient, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database-friendly, in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage. Also, we extend our approach to deal with the k nearest neighbors.We conduct two sets of experiments to evaluate the efficacy of our methods. Our experiments include two scenarios where nearest neighbors are typically employed---similarity search and classification problems. In both cases, we study the performance of our methods with respect to several evaluation criteria, and conclude that they are uniformly excellent, both in terms of quality of results and in terms of efficiency." SIGMOD Conference Processing Set Expressions over Continuous Update Streams. Sumit Ganguly,Minos N. Garofalakis,Rajeev Rastogi 2003 "There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask ""what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?"". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed ""2-level hash sketch"". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach." SIGMOD Conference The Lowell Report. Jim Gray,Hans-Jörg Schek,Michael Stonebraker,Jeffrey D. Ullman 2003 The Lowell Report. SIGMOD Conference Integration of Electronic Tickets and Personal Guide System for Public Transport using Mobile Terminals. Koichi Goto,Yahiko Kambayashi 2003 "We have been developing a mobile passenger guide system for public transports. Passengers can make their travel plans and purchase necessary electronic tickets using mobile terminals via Internet. During the travel, the mobile terminal, which also works as an electronic ticket, compares the stored travel plan with the passenger's actual activities and offers appropriate guide messages. To execute this task, the mobile terminal collects various kinds of information about the travel fields (routes, fares, area maps, station maps, operation schedule, timetables, facilities of stations and vehicles etc.) using multi-channel data communications. The mobile terminal contains a personal database for the passenger by selecting and integrating necessary data according to the user's situation and characteristics." SIGMOD Conference XRANK: Ranked Keyword Search over XML Documents. Lin Guo,Feng Shao,Chavdar Botev,Jayavel Shanmugasundaram 2003 We consider the problem of efficiently producing ranked results for keyword search queries over hyperlinked XML documents. Evaluating keyword search queries over hierarchical XML documents, as opposed to (conceptually) flat HTML documents, introduces many new challenges. First, XML keyword search queries do not always return entire documents, but can return deeply nested XML elements that contain the desired keywords. Second, the nested structure of XML implies that the notion of ranking is no longer at the granularity of a document, but at the granularity of an XML element. Finally, the notion of keyword proximity is more complex in the hierarchical XML data model. In this paper, we present the XRANK system that is designed to handle these novel features of XML keyword search. Our experimental results show that XRANK offers both space and performance benefits when compared with existing approaches. An interesting feature of XRANK is that it naturally generalizes a hyperlink based HTML search engine such as Google. XRANK can thus be used to query a mix of HTML and XML documents. SIGMOD Conference BIRN-M: A Semantic Mediator for Solving Real-World Neuroscience Problems. Amarnath Gupta,Bertram Ludäscher,Maryann E. Martone 2003 BIRN-M: A Semantic Mediator for Solving Real-World Neuroscience Problems. SIGMOD Conference Stream Processing of XPath Queries with Predicates. Ashish Kumar Gupta,Dan Suciu 2003 We consider the problem of evaluating large numbers of XPath filters, each with many predicates, on a stream of XML documents. The solution we propose is to lazily construct a single deterministic pushdown automata, called the XPush Machine from the given XPath fllters. We describe a number of optimization techniques to make the lazy XPush machine more efficient, both in terms of space and time. The combination of these optimizations results in high, sustained throughput. For example, if the total number of atomic predicates in the filters is up to 200000, then the throughput is at least 0.5 MB/sec: it increases to 4.5 MB/sec when each fllter contains a single predicate. SIGMOD Conference Estimating Compilation Time of a Query Optimizer. Ihab F. Ilyas,Jun Rao,Guy M. Lohman,Dengfeng Gao,Eileen Tien Lin 2003 "A query optimizer compares alternative plans in its search space to find the best plan for a given query. Depending on the search space and the enumeration algorithm, optimizers vary in their compilation time and the quality of the execution plan they can generate. This paper describes a compilation time estimator that provides a quantified estimate of the optimizer compilation time for a given query. Such an estimator is useful for automatically choosing the right level of optimization in commercial database systems. In addition, compilation time estimates can be quite helpful for mid-query reoptimization, for monitoring the progress of workload analysis tools where a large number queries need to be compiled (but not executed), and for judicious design and tuning of an optimizer.Previous attempts to estimate optimizer compilation complexity used the number of possible binary joins as the metric and overlooked the fact that each join often translates into a different number of join plans because of the presence of ""physical"" properties. We use the number of plans (instead of joins) to estimate query compilation time, and employ two novel ideas: (1) reusing an optimizer's join enumerator to obtain actual number of joins, but bypassing plan generation to save estimation overhead; (2) maintaining a small number of ""interesting"" properties to facilitate plan counting. We prototyped our approach in a commercial database system and our experimental results show that we can achieve good compilation time estimates (less than 30% error, on average) for complex real queries, using a small fraction (within 3%) of the actual compilation time." SIGMOD Conference Data Grid Management Systems. Arun Jagatheesan,Arcot Rajasekar 2003 Data Grids are being built across the world as the next generation data handling systems to manage peta-bytes of inter organizational data and storage space. A data grid (datagrid) is a logical name space consisting of storage resources and digital entities that is created by the cooperation of autonomous organizations and its users based on the coordination of local and global policies. Data Grid Management Systems (DGMSs) provide services for the confluence of organizations and management of inter-organizational data and resources in the datagrid.The objective of the tutorial is to provide an introduction to the opportunities and challenges of this emerging technology. Novices and experts would benefit from this tutorial. The tutorial would cover introduction, use cases, design philosophies, architecture, research issues, existing technologies and demonstrations. Hands on sessions for the participants to use and feel the existing technologies could be provided based on the availability of internet connections. SIGMOD Conference Data Quality and Data Cleaning: An Overview. Theodore Johnson,Tamraparni Dasu 2003 "Data quality is a serious concern in any data-driven enterprise, often creating misleading findings during data mining, and causing process disruptions in operational databases. The manifestations of data quality problems can be very expensive- ""losing"" customers, ""misplacing"" billions of dollars worth of equipment, misallocated resources due to glitched forecasts, and so on. Solving data quality problems typically requires a very large investment of time and energy -- often 80% to 90% of a data analysis project is spent in making the data reliable enough that the results can be trusted.In this tutorial, we present a multi disciplinary approach to data quality problems. We start by discussing the meaning of data quality and the sources of data quality problems. We show how these problems can be addressed by a multidisciplinary approach, combining techniques from management science, statistics, database research, and metadata management. Next, we present an updated definition of data quality metrics, and illustrate their application with a case study. We conclude with a survey of recent database research that is relevant to data quality problems, and suggest directions for future research." SIGMOD Conference On Schema Matching with Opaque Column Names and Data Values. Jaewoo Kang,Jeffrey F. Naughton 2003 "Most previous solutions to the schema matching problem rely in some fashion upon identifying ""similar"" column names in the schemas to be matched, or by recognizing common domains in the data stored in the schemas. While each of these approaches is valuable in many cases, they are not infallible, and there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are ""opaque"" or very difficult to interpret. In this paper we propose a two-step technique that works even in the presence of opaque column names and data values. In the first step, we measure the pair-wise attribute correlations in the tables to be matched and construct a dependency graph using mutual information as a measure of the dependency between attributes. In the second stage, we find matching node pairs in the dependency graphs by running a graph matching algorithm. We validate our approach with an experimental study, the results of which suggest that such an approach can be a useful addition to a set of (semi) automatic schema matching techniques." SIGMOD Conference Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues. Anastasios Kementsietsidis,Marcelo Arenas,Renée J. Miller 2003 We consider the problem of mapping data in peer-to-peer data-sharing systems. Such systems often rely on the use of mapping tables listing pairs of corresponding values to search for data residing in different peers. In this paper, we address semantic and algorithmic issues related to the use of mapping tables. We begin by arguing why mapping tables are appropriate for data mapping in a peer-to-peer environment. We discuss alternative semantics for these tables and we present a language that allows the user to specify mapping tables under different semantics. Then, we show that by treating mapping tables as constraints (called mapping constraints) on the exchange of information between peers it is possible to reason about them. We motivate why reasoning capabilities are needed to manage mapping tables and show the importance of inferring new mapping tables from existing ones. We study the complexity of this problem and we propose an efficient algorithm for its solution. Finally, we present an implementation along with experimental results that show that mapping tables may be managed efficiently in practice. SIGMOD Conference Qcluster: Relevance Feedback Using Adaptive Clustering for Content-Based Image Retrieval. Deok-Hwan Kim,Chin-Wan Chung 2003 "The learning-enhanced relevance feedback has been one of the most active research areas in content-based image retrieval in recent years. However, few methods using the relevance feedback are currently available to process relatively complex queries on large image databases. In the case of complex image queries, the feature space and the distance function of the user's perception are usually different from those of the system. This difference leads to the representation of a query with multiple clusters (i.e., regions) in the feature space. Therefore, it is necessary to handle disjunctive queries in the feature space.In this paper, we propose a new content-based image retrieval method using adaptive classification and cluster-merging to find multiple clusters of a complex image query. When the measures of a retrieval method are invariant under linear transformations, the method can achieve the same retrieval quality regardless of the shapes of clusters of a query. Our method achieves the same high retrieval quality regardless of the shapes of clusters of a query since it uses such measures. Extensive experiments show that the result of our method converges to the user's true information need fast, and the retrieval quality of our method is about 22% in recall and 20% in precision better than that of the query expansion approach, and about 34% in recall and about 33% in precision better than that of the query point movement approach, in MARS." SIGMOD Conference IPSOFACTO: A Visual Correlation Tool for Aggregate Network Traffic Data. Flip Korn,S. Muthukrishnan,Yunyue Zhu 2003 IP network operators collect aggregate traffic statistics on network interfaces via the Simple Network Management Protocol (SNMP). This is part of routine network operations for most ISPs; it involves a large infrastructure with multiple network management stations polling information from all the network elements and collating a real time data feed. This demo will present a tool that manages the live SNMP data feed on a fully operational large ISP at industry scale. The tool primarily serves to study correlations in the network traffic, by providing a rich mix of ad-hoc querying based on a user-friendly correlation interface and as well as canned queries, based on the expertise of the network operators with field experience. The tool is called IPSOFACTO for IP Stream-Oriented FAst Correlation TOol. SIGMOD Conference Panel: Querying Networked Databases. Nick Koudas,Divesh Srivastava 2003 Panel: Querying Networked Databases. SIGMOD Conference Using Sets of Feature Vectors for Similarity Search on Voxelized CAD Objects. Hans-Peter Kriegel,Stefan Brecheisen,Peer Kröger,Martin Pfeifle,Matthias Schubert 2003 In modern application domains such as multimedia, molecular biology and medical imaging, similarity search in database systems is becoming an increasingly important task. Especially for CAD applications, suitable similarity models can help to reduce the cost of developing and producing new parts by maximizing the reuse of existing parts. Most of the existing similarity models are based on feature vectors. In this paper, we shortly review three models which pursue this paradigm. Based on the most promising of these three models, we explain how sets of feature vectors can be used for more effective and still efficient similarity search. We first introduce an intuitive distance measure on sets of feature vectors together with an algorithm for its efficient computation. Furthermore, we present a method for accelerating the processing of similarity queries on vector set data. The experimental evaluation is based on two real world test data sets and points out that our new similarity approach yields more meaningful results in comparatively short time. SIGMOD Conference Oracle XML DB Repository. Viswanathan Krishnamurthy 2003 Oracle XML DB Repository. SIGMOD Conference QC-Trees: An Efficient Summary Structure for Semantic OLAP. Laks V. S. Lakshmanan,Jian Pei,Yan Zhao 2003 Recently, a technique called quotient cube was proposed as a summary structure for a data cube that preserves its semantics, with applications for online exploration and visualization. The authors showed that a quotient cube can be constructed very efficiently and it leads to a significant reduction in the cube size. While it is an interesting proposal, that paper leaves many issues unaddressed. Firstly, a direct representation of a quotient cube is not as compact as possible and thus still wastes space. Secondly, while a quotient cube can in principle be used for answering queries, no specific algorithms were given in the paper. Thirdly, maintaining any summary structure incrementally against updates is an important task, a topic not addressed there. In this paper, we propose an efficient data structure called QC-tree and an efficient algorithm for directly constructing it from a base table, solving the first problem. We give efficient algorithms that address the remaining questions. We report results from an extensive performance study that illustrate the space and time savings achieved by our algorithms over previous ones (wherever they exist). SIGMOD Conference SOCQET: Semantic OLAP with Compressed Cube and Summarization. Laks V. S. Lakshmanan,Jian Pei,Yan Zhao 2003 SOCQET: Semantic OLAP with Compressed Cube and Summarization. SIGMOD Conference Transparent Mid-Tier Database Caching in SQL Server. Per-Åke Larson,Jonathan Goldstein,Jingren Zhou 2003 Transparent Mid-Tier Database Caching in SQL Server. SIGMOD Conference Composing XSL Transformations with XML Publishing Views. Chengkai Li,Philip Bohannon,Henry F. Korth,P. P. S. Narayan 2003 Composing XSL Transformations with XML Publishing Views. SIGMOD Conference A Theory of Redo Recovery. David B. Lomet,Mark R. Tuttle 2003 Our goal is to understand redo recovery. We define an installation graph of operations in an execution, an ordering significantly weaker than conflict ordering from concurrency control. The installation graph explains recoverable system state in terms of which operations are considered installed. This explanation and the set of operations replayed during recovery form an invariant that is the contract between normal operation and recovery. It prescribes how to coordinate changes to system components such as the state, the log, and the cache. We also describe how widely used recovery techniques are modeled in our theory, and why they succeed in providing redo recovery. SIGMOD Conference GridDB: A Database Interface to the Grid. David T. Liu,Michael J. Franklin,Devesh Parekh 2003 GridDB: A Database Interface to the Grid. SIGMOD Conference The Design of an Acquisitional Query Processor For Sensor Networks. Samuel Madden,Michael J. Franklin,Joseph M. Hellerstein,Wei Hong 2003 We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired (sampled) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to significantly reduce power consumption over traditional passive systems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling data acquisition, and show how acquisitional issues influence query optimization, dissemination, and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for smart sensor devices, and show how acquisitional techniques can provide significant reductions in power consumption on our sensor devices. SIGMOD Conference Efficient Processing of Joins on Set-valued Attributes. Nikos Mamoulis 2003 Object-oriented and object-relational DBMS support set valued attributes, which are a natural and concise way to model complex information. However, there has been limited research to-date on the evaluation of query operators that apply on sets. In this paper we study the join of two relations on their set-valued attributes. Various join types are considered, namely the set containment, set equality, and set overlap joins. We show that the inverted file, a powerful index for selection queries, can also facilitate the efficient evaluation of most join predicates. We propose join algorithms that utilize inverted files and compare them with signature-based methods for several set-comparison predicates. SIGMOD Conference Rondo: A Programming Platform for Generic Model Management. Sergey Melnik,Erhard Rahm,Philip A. Bernstein 2003 Model management aims at reducing the amount of programming needed for the development of metadata-intensive applications. We present a first complete prototype of a generic model management system, in which high-level operators are used to manipulate models and mappings between models. We define the key conceptual structures: models, morphisms, and selectors, and describe their use and implementation. We specify the semantics of the known model-management operators applied to these structures, suggest new ones, and develop new algorithms for implementing the individual operators. We examine the solutions for two model-management tasks that involve manipulations of relational schemas, XML schemas, and SQL views. SIGMOD Conference Exchanging Intensional XML Data. Tova Milo,Serge Abiteboul,Bernd Amann,Omar Benjelloun,Frederic Dang Ngoc 2003 XML is becoming the universal format for data exchange between applications. Recently, the emergence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call intensional documents. These are XML documents where some of the data is given explicitly while other parts are defined only intensionally by means of embedded calls to Web services.When such documents are exchanged between applications, one has the choice of whether or not to materialize the intensional data (i.e., to invoke the embedded calls) before the document is sent. This choice may be influenced by various parameters, such as performance and security considerations. This article addresses the problem of guiding this materialization process.We argue that---like for regular XML data---schemas (à la DTD and XML Schema) can be used to control the exchange of intensional data and, in particular, to determine which data should be materialized before sending a document, and which should not. We formalize the problem and provide algorithms to solve it. We also present an implementation that complies with real-life standards for XML data, schemas, and Web services, and is used in the Active XML system. We illustrate the usefulness of this approach through a real-life application for peer-to-peer news exchange. SIGMOD Conference XPRESS: A Queriable Compression for XML Data. Jun-Ki Min,Myung-Jae Park,Chin-Wan Chung 2003 Like HTML, many XML documents are resident on native file systems. Since XML data is irregular and verbose, the disk space and the network bandwidth are wasted. To overcome the verbosity problem, the research on compressors for XML data has been conducted. However, some XML compressors do not support querying compressed data, while other XML compressors which support querying compressed data blindly encode tags and data values using predefined encoding methods. Thus, the query performance on compressed XML data is degraded.In this paper, we propose XPRESS, an XML compressor which supports direct and efficient evaluations of queries on compressed XML data. XPRESS adopts a novel encoding method, called reverse arithmetic encoding, which is intended for encoding label paths of XML data, and applies diverse encoding methods depending on the types of data values. Experimental results with real life data sets show that XPRESS achieves significant improvements on query performance for compressed XML data and reasonable compression ratios. On the average, the query performance of XPRESS is 2.83 times better than that of an existing XML compressor and the compression ratio of XPRESS is 73%. SIGMOD Conference Adaptive Filters for Continuous Queries over Distributed Data Streams. Chris Olston,Jing Jiang,Jennifer Widom 2003 We consider an environment where distributed data sources continuously stream updates to a centralized processor that monitors continuous queries over the distributed data. Significant communication overhead is incurred in the presence of rapid update streams, and we propose a new technique for reducing the overhead. Users register continuous queries with precision requirements at the central stream processor, which installs filters at remote data sources. The filters adapt to changing conditions to minimize stream rates while guaranteeing that all continuous queries still receive the updates necessary to provide answers of adequate precision at all times. Our approach enables applications to trade precision for communication overhead at a fine granularity by individually adjusting the precision constraints of continuous queries over streams in a multi-query workload. Through experiments performed on synthetic data simulations and a real network monitoring implementation, we demonstrate the effectiveness of our approach in achieving low communication overhead compared with alternate approaches. SIGMOD Conference PeerDB: Peering into Personal Databases. Beng Chin Ooi,Kian-Lee Tan,Aoying Zhou,Chin Hong Goh,Yingguang Li,Chu Yee Liau,Bo Ling,Wee Siong Ng,Yanfeng Shu,Xiaoyu Wang,Ming Zhang 2003 PeerDB: Peering into Personal Databases. SIGMOD Conference Multi-Dimensional Clustering: A New Data Layout Scheme in DB2. Sriram Padmanabhan,Bishwaranjan Bhattacharjee,Timothy Malkemus,Leslie Cranston,Matthew Huras 2003 We describe the design and implementation of a new data layout scheme, called multi-dimensional clustering, in DB2 Universal Database Version 8. Many applications, e.g., OLAP and data warehousing, process a table or tables in a database using a multi-dimensional access paradigm. Currently, most database systems can only support organization of a table using a primary clustering index. Secondary indexes are created to access the tables when the primary key index is not applicable. Unfortunately, secondary indexes perform many random I/O accesses against the table for a simple operation such as a range query. Our work in multi-dimensional clustering addresses this important deficiency in database systems. Multi-Dimensional Clustering is based on the definition of one or more orthogonal clustering attributes (or expressions) of a table. The table is organized physically by associating records with similar values for the dimension attributes in a cluster. We describe novel techniques for maintaining this physical layout efficiently and methods of processing database operations that provide significant performance improvements. We show results from experiments using a star-schema database to validate our claims of performance with minimal overhead. SIGMOD Conference An Optimal and Progressive Algorithm for Skyline Queries. Dimitris Papadias,Yufei Tao,Greg Fu,Bernhard Seeger 2003 The skyline of a set of d-dimensional points contains the points that are not dominated by any other point on all dimensions. Skyline computation has recently received considerable attention in the database community, especially for progressive (or online) algorithms that can quickly return the first skyline points without having to read the entire data file. Currently, the most efficient algorithm is NN (nearest neighbors), which applies the divide -and-conquer framework on datasets indexed by R-trees. Although NN has some desirable features (such as high speed for returning the initial skyline points, applicability to arbitrary data distributions and dimensions), it also presents several inherent disadvantages (need for duplicate elimination if d>2, multiple accesses of the same node, large space overhead). In this paper we develop BBS (branch-and-bound skyline), a progressive algorithm also based on nearest neighbor search, which is IO optimal, i.e., it performs a single access only to those R-tree nodes that may contain skyline points. Furthermore, it does not retrieve duplicates and its space overhead is significantly smaller than that of NN. Finally, BBS is simple to implement and can be efficiently applied to a variety of alternative skyline queries. An analytical and experimental comparison shows that BBS outperforms NN (usually by orders of magnitude) under all problem instances. SIGMOD Conference TIMBER: A Native System for Querying XML. Stelios Paparizos,Shurug Al-Khalifa,Adriane Chapman,H. V. Jagadish,Laks V. S. Lakshmanan,Andrew Nierman,Jignesh M. Patel,Divesh Srivastava,Nuwee Wiwatwattana,Yuqing Wu,Cong Yu 2003 XML has become ubiquitous, and XML data has to be managed in databases. The current industry standard is to map XML data into relational tables and store this information in a relational database. Such mappings create both expressive power problems and performance problems.In the TIMBER [7] project we are exploring the issues involved in storing XML in native format. We believe that the key intellectual contribution of this system is a comprehensive set-at-a-time query processing ability in a native XML store, with all the standard components of relational query processing, including algebraic rewriting and a cost-based optimizer. SIGMOD Conference D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data. Chen Qun,Andrew Lim,Kian Win Ong 2003 D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data. SIGMOD Conference Extracting and Exploiting Structure in Text Search. Prabhakar Raghavan 2003 Extracting and Exploiting Structure in Text Search. SIGMOD Conference XPath Queries on Streaming Data. Feng Peng,Sudarshan S. Chawathe 2003 We present the design and implementation of the XSQ system for querying streaming XML data using XPath 1.0. Using a clean design based on a hierarchical arrangement of pushdown transducers augmented with buffers, XSQ supports features such as multiple predicates, closures, and aggregation. XSQ not only provides high throughput, but is also memory efficient: It buffers only data that must be buffered by any streaming XPath processor. We also present an empirical study of the performance characteristics of XPath features, as embodied by XSQ and several other systems. SIGMOD Conference Oracle RAC: Architecture and Performance. Angelo Pruscino 2003 Oracle RAC: Architecture and Performance. SIGMOD Conference A Characterization of the Sensitivity of Query Optimization to Storage Access Cost Parameters. Frederick Reiss,Tapas Kanungo 2003 Most relational query optimizers make use of information about the costs of accessing tuples and data structures on various storage devices. This information can at times be off by several orders of magnitude due to human error in configuration setup, sudden changes in load, or hardware failure. In this paper, we attempt to answer the following questions:• Are inaccurate access cost estimates likely to cause a typical query optimizer to choose a suboptimal query plan?• If an optimizer chooses a suboptimal plan as a result of inaccurate access cost estimates, how far from optimal is this plan likely to be?To address these issues, we provide a theoretical, vector-based framework for analyzing the costs of query plans under various storage parameter costs. We then use this geometric framework to characterize experimentally a commercial query optimizer. We develop algorithms for extracting detailed information about query plans through narrow optimizer interfaces, and we perform the characterization using database statistics from a published run of the TPC-H benchmark and a wide range of storage parameters.We show that, when data structures such as tables, indexes, and sorted runs reside on different storage devices, the optimizer can derive significant benefits from having accurate and timely information regarding the cost of accessing storage devices. SIGMOD Conference Winnowing: Local Algorithms for Document Fingerprinting. Saul Schleimer,Daniel Shawcross Wilkerson,Alexander Aiken 2003 "Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service." SIGMOD Conference PLASTIC: Reducing Query Optimization Overheads through Plan Recycling. Vibhuti S. Sengar,Jayant R. Haritsa 2003 PLASTIC: Reducing Query Optimization Overheads through Plan Recycling. SIGMOD Conference Building Notification Services with Microsoft SQLServer. Praveen Seshadri 2003 Building Notification Services with Microsoft SQLServer. SIGMOD Conference CMVF: A Novel Dimension Reduction Scheme for Efficient Indexing in A Large Image Database. Jialie Shen,Anne H. H. Ngu,John Shepherd,Du Q. Huynh,Quan Z. Sheng 2003 CMVF: A Novel Dimension Reduction Scheme for Efficient Indexing in A Large Image Database. SIGMOD Conference Google in a Box - Building the Google Search Appliance. Narayanan Shivakumar 2003 Google in a Box - Building the Google Search Appliance. SIGMOD Conference Rights Protection for Relational Data. Radu Sion,Mikhail J. Atallah,Sunil Prabhakar 2003 In this paper, we introduce a solution for relational database content rights protection through watermarking. Rights protection for relational data is of ever-increasing interest, especially considering areas where sensitive, valuable content is to be outsourced. A good example is a data mining application, where data is sold in pieces to parties specialized in mining it. Different avenues are available, each with its own advantages and drawbacks. Enforcement by legal means is usually ineffective in preventing theft of copyrighted works, unless augmented by a digital counterpart, for example, watermarking. While being able to handle higher level semantic constraints, such as classification preservation, our solution also addresses important attacks, such as subset selection and random and linear data changes. We introduce wmdb.*, a proof-of-concept implementation and its application to real-life data, namely, in watermarking the outsourced Wal-Mart sales data that we have available at our institute. SIGMOD Conference Scientific Data Repositories: Designing for a Moving Target. Etzard Stolte,Christoph von Praun,Gustavo Alonso,Thomas R. Gross 2003 Managing scientific data warehouses requires constant adaptations to cope with changes in processing algorithms, computing environments, database schemas, and usage patterns. We have faced this challenge in the RHESSI Experimental Data Center (HEDC), a datacenter for the RHESSI NASA spacecraft. In this paper we describe our experience in developing HEDC and discuss in detail the design choices made. To successfully accommodate typical adaptations encountered in scientific data management systems, HEDC (i) clearly separates generic from domain specific code in all tiers, (ii) uses a file system for the actual data in combination with a DBMS to manage the corresponding meta data, and (iii) revolves around a middle tier designed to scale if more browsing or processing power is required. These design choices are valuable contributions as they address common concerns in a wide range of scientific data management systems. SIGMOD Conference Visionary: A Next Generation Visualization System for Databases. Michael Stonebraker 2003 Visionary: A Next Generation Visualization System for Databases. SIGMOD Conference Hardware Acceleration for Spatial Selections and Joins. Chengyu Sun,Divyakant Agrawal,Amr El Abbadi 2003 Spatial database operations are typically performed in two steps. In the filtering step, indexes and the minimum bounding rectangles (MBRs) of the objects are used to quickly determine a set of candidate objects, and in the refinement step, the actual geometries of the objects are retrieved and compared to the query geometry or each other. Because of the complexity of the computational geometry algorithms involved, the CPU cost of the refinement step is usually the dominant cost of the operation for complex geometries such as polygons. In this paper, we propose a novel approach to address this problem using efficient rendering and searching capabilities of modern graphics hardware. This approach does not require expensive pre-processing of the data or changes to existing storage and index structures, and it applies to both intersection and distance predicates. Our experiments with real world datasets show that by combining hardware and software methods, the overall computational cost can be reduced substantially for both spatial selections and joins. SIGMOD Conference Improving the Efficiency of Database-System Teaching. Jeffrey D. Ullman 2003 "The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas carry over to other programming-oriented courses, and many of them apply to any academic subject whatsoever. After proposing a number of things that could be done, I concentrate here on a system under development, called OTC (On-line Testing Center), and on its methodology of ""root questions."" These questions encourage students to do homework of the long-answer type, yet we can have their work checked and graded automatically by a simple multiple-choice-question grader. OTC also offers some improvement in the way we handle SQL homework, and could be used with other languages as well." SIGMOD Conference Containment Join Size Estimation: Models and Methods. Wei Wang,Haifeng Jiang,Hongjun Lu,Jeffrey Xu Yu 2003 Recent years witnessed an increasing interest in researches in XML, partly due to the fact that XML has now become the de facto standard for data interchange over the internet. A large amount of work has been reported on XML storage models and query processing techniques. However, few works have addressed issues of XML query optimization. In this paper, we report our study on one of the challenges in XML query optimization: containment join size estimation. Containment join is well accepted as an important operation in XML query processing. Estimating the size of its results is no doubt essential to generate efficient XML query processing plans. We propose two models, the interval model and the position model, and a set of estimation methods based on these two models. Comprehensive performance studies were conducted. The results not only demonstrate the advantages of our new algorithms over existing algorithms, but also provide valuable insights into the tradeoff among various parameters. SIGMOD Conference ViST: A Dynamic Index Method for Querying XML Data by Tree Structures. Haixun Wang,Sanghyun Park,Wei Fan,Philip S. Yu 2003 With the growing importance of XML in data exchange, much research has been done in providing flexible query facilities to extract data from structured XML documents. In this paper, we propose ViST, a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, we show that querying XML data is equivalent to finding subsequence matches. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure. ViST supports dynamic index update, and it relies solely on B+ Trees without using any specialized data structures that are not well supported by DBMSs. Our experiments show that ViST is effective, scalable, and efficient in supporting structural queries. SIGMOD Conference Spreadsheets in RDBMS for OLAP. Andrew Witkowski,Srikanth Bellamkonda,Tolga Bozkaya,Gregory Dorman,Nathan Folkert,Abhinav Gupta,Lei Sheng,Sankar Subramanian 2003 One of the critical deficiencies of SQL is lack of support for n-dimensional array-based computations which are frequent in OLAP environments. Relational OLAP (ROLAP) applications have to emulate them using joins, recently introduced SQL Window Functions [18] and complex and inefficient CASE expressions. The designated place in SQL for specifying calculations is the SELECT clause, which is extremely limiting and forces the user to generate queries using nested views, subqueries and complex joins. Furthermore, SQL-query optimizer is pre-occupied with determining efficient join orders and choosing optimal access methods and largely disregards optimization of complex numerical formulas. Execution methods concentrated on efficient computation of a cube [11], [16] rather than on random access structures for inter-row calculations. This has created a gap that has been filled by spreadsheets and specialized MOLAP engines, which are good at formulas for mathematical modeling but lack the formalism of the relational model, are difficult to manage, and exhibit scalability problems. This paper presents SQL extensions involving array based calculations for complex modeling. In addition, we present optimizations, access structures and execution models for processing them efficiently. SIGMOD Conference NonStop SQL/MX Publish/Subscribe: Continuous Data Streams in Transaction Processing. Hansjörg Zeller 2003 NonStop SQL/MX Publish/Subscribe: Continuous Data Streams in Transaction Processing. SIGMOD Conference Rainbow: Multi-XQuery Optimization Using Materialized XML Views. Xin Zhang,Katica Dimitrova,Ling Wang,Maged El-Sayed,Brian Murphy,Bradford Pielech,Mukesh Mulchandani,Luping Ding,Elke A. Rundensteiner 2003 Rainbow: Multi-XQuery Optimization Using Materialized XML Views. SIGMOD Conference Location-based Spatial Queries. Jun Zhang,Manli Zhu,Dimitris Papadias,Yufei Tao,Dik Lun Lee 2003 "In this paper we propose an approach that enables mobile clients to determine the validity of previous queries based on their current locations. In order to make this possible, the server returns in addition to the query result, a validity region around the client's location within which the result remains the same. We focus on two of the most common spatial query types, namely nearest neighbor and window queries, define the validity region in each case and propose the corresponding query processing algorithms. In addition, we provide analytical models for estimating the expected size of the validity region. Our techniques can significantly reduce the number of queries issued to the server, while introducing minimal computational and network overhead compared to traditional spatial queries." SIGMOD Conference TREX: DTD-Conforming XML to XML Transformations. Aoying Zhou,Qing Wang,Zhimao Guo,Xueqing Gong,Shihui Zheng,Hongwei Wu,Jianchang Xiao,Kun Yue,Wenfei Fan 2003 TREX: DTD-Conforming XML to XML Transformations. SIGMOD Conference Warping Indexes with Envelope Transforms for Query by Humming. Yunyue Zhu,Dennis Shasha 2003 A Query by Humming system allows the user to find a song by humming part of the tune. No musical training is needed. Previous query by humming systems have not provided satisfactory results for various reasons. Some systems have low retrieval precision because they rely on melodic contour information from the hum tune, which in turn relies on the error-prone note segmentation process. Some systems yield better precision when matching the melody directly from audio, but they are slow because of their extensive use of Dynamic Time Warping (DTW). Our approach improves both the retrieval precision and speed compared to previous approaches. We treat music as a time series and exploit and improve well-developed techniques from time series databases to index the music for fast similarity queries. We improve on existing DTW indexes technique by introducing the concept of envelope transforms, which gives a general guideline for extending existing dimensionality reduction methods to DTW indexes. The net result is high scalability. We confirm our claims through extensive experiments. SIGMOD Conference Query by Humming - in Action with its Technology Revealed. Yunyue Zhu,Dennis Shasha,Xiaojian Zhao 2003 Query by Humming - in Action with its Technology Revealed. SIGMOD Conference WinMagic : Subquery Elimination Using Window Aggregation. Calisto Zuzarte,Hamid Pirahesh,Wenbin Ma,Qi Cheng,Linqi Liu,Kwai Wong 2003 "Database queries often take the form of correlated SQL queries. Correlation refers to the use of values from the outer query block to compute the inner subquery. This is a convenient paradigm for SQL programmers and closely mimics a function invocation paradigm in a typical computer programming language. Queries with correlated subqueries are also often created by SQL generators that translate queries from application domain-specific languages into SQL. Another significant class of queries that use this correlated subquery form is that involving temporal databases using SQL. Performance of these queries is an important consideration particularly in large databases. Several proposals to improve the performance of SQL queries containing correlated subqueries can be found in database literature. One of the main ideas in many of these proposals is to suitably decorrelate the subquery internally to avoid a tuple-at-a-time invocation of the subquery. Magic decorrelation is one method that has been successfully used. Another proposal is to cache the portion of the subquery that is invariant with the changing values of the outer query block. What we propose here is a new technique to handle some typical correlated queries. We go a step further than to simply decorrelate the subquery. By making use of extended window aggregation capabilities, we eliminate redundant access to common tables referenced in the outer query block and the subquery. This technique can be exploited even for non-correlated subqueries. It is possible to get a huge boost in performance for queries that can exploit this technique, which we call WinMagic. This technique was implemented in IBM® DB2® Universal Database"" Version 7 and Version 8. In addition to improving DB2 customer queries that contain aggregation subqueries, it has provided significant improvements in a number of TPCH benchmarks that IBM has published since late in 2001." SIGMOD Conference Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003 Alon Y. Halevy,Zachary G. Ives,AnHai Doan 2003 Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003 VLDB Improving Performance with Bulk-Inserts in Oracle R-Trees. Ning An,Kothuri Venkata Ravi Kanth,Siva Ravada 2003 Improving Performance with Bulk-Inserts in Oracle R-Trees. VLDB Memory Requirements for Query Execution in Highly Constrained Devices. Nicolas Anciaux,Luc Bouganim,Philippe Pucheral 2003 "Pervasive computing introduces data management requirements that must be tackled in a growing variety of lightweight computing devices. Personal folders on chip, networks of sensors and data hosted by autonomous mobile computers are different illustrations of the need for evaluating queries confined in hardware constrained computing devices. RAM is the most limiting factor in this context. This paper gives a thorough analysis of the RAM consumption problem and makes the following contributions. First, it proposes a query execution model that reaches a lower bound in terms of RAM consumption. Second, it devises a new form of optimization, called iteration filter, that drastically reduces the prohibitive cost incurred by the preceding model, without hurting the RAM lower bound. Third, it analyses how the preceding techniques can benefit from an incremental growth of RAM. This work paves the way for setting up co-design rules helping to calibrate the RAM resource of a hardware platform according to given application's requirements as well as to adapt an application to an existing hardware platform. To the best of our knowledge, this work is the first attempt to devise co-design rules for data centric embedded applications. We illustrate the effectiveness of our techniques through a performance evaluation." VLDB Xquec: Pushing Queries to Compressed XML Data. "Andrei Arion,Angela Bonifati,Gianni Costa,Sandra D'Aguanno,Ioana Manolescu,Andrea Pugliese" 2003 Xquec: Pushing Queries to Compressed XML Data. VLDB Schema-driven Customization of Web Services. Serge Abiteboul,Bernd Amann,Jérôme Baumgarten,Omar Benjelloun,Frederic Dang Ngoc,Tova Milo 2003 Schema-driven Customization of Web Services. VLDB Managing Distributed Workspaces with Active XML. Serge Abiteboul,Jérôme Baumgarten,Angela Bonifati,Gregory Cobena,Cosmin Cremarenco,Florin Dragan,Ioana Manolescu,Tova Milo,Nicoleta Preda 2003 Managing Distributed Workspaces with Active XML. VLDB A Framework for Clustering Evolving Data Streams. Charu C. Aggarwal,Jiawei Han,Jianyong Wang,Philip S. Yu 2003 The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an offine component which uses only this summary statistics. The offine component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a microclustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach. VLDB Cache Tables: Paving the Way for an Adaptive Database Cache. Mehmet Altinel,Christof Bornhövd,Sailesh Krishnamurthy,C. Mohan,Hamid Pirahesh,Berthold Reinwald 2003 "We introduce a new database object called Cache Table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tables exploit the characteristics of typical transactional web applications with a high volume of short transactions, simple equality predicates, and 3-4 way joins. Based on federated query processing capabilities, we developed a set of new technologies for database caching: cache tables, ""Janus"" (two-headed) query execution plans, cache constraints, and asynchronous cache population methods. Our solution supports transparent caching both at the edge of content-delivery networks and in the middle-tier of an enterprise application infrastructure, improving the response time, throughput and scalability of transactional web applications." VLDB Phrase Matching in XML. Sihem Amer-Yahia,Mary F. Fernández,Divesh Srivastava,Yu Xu 2003 "Phrase matching is a common IR technique to search text and identify relevant documents in a document collection. Phrase matching in XML presents new challenges as text may be interleaved with arbitrary markup, thwarting search techniques that require strict contiguity or close proximity of keywords. We present a technique for phrase matching in XML that permits dynamic specification of both the phrase to be matched and the markup to be ignored. We develop an effective algorithm for our technique that utilizes inverted indices on phrase words and XML tags. We describe experimental results comparing our algorithm to an indexed-nested loop algorithm that illustrate our algorithm's efficiency." VLDB A System for Keyword Proximity Search on XML Databases. Andrey Balmin,Vagelis Hristidis,Nick Koudas,Yannis Papakonstantinou,Divesh Srivastava,Tianqiu Wang 2003 A System for Keyword Proximity Search on XML Databases. VLDB Large-Scale, Standards-Based Earth Observation Imagery and Web Mapping Services. Peter Baumann 2003 "Earth observation (EO) and simulation data share some core characteristics: they resemble raster data of some spatio-temporal dimensionality; the complete objects are extremely large, well into Tera- and Petabyte volumes; data generation and retrieval follow very different access patterns. EO time series additionally share that acquisition/generation happens in time slices. The central standardization body for geo service interfaces is the Open GIS Consortium (OGC). Earlier OGC has issued the Web Map Service (WMS) Interface Specification which addresses 2-D (raster and vector) maps. This year, the Web Coverage Service (WCS) Specification has been added with specific focus on 2-D and 3-D rasters (""coverages""). In this paper we present operational applications offering WMS/WCS services: a 2-D ortho photo maintained by the Bavarian Mapping Agency and a 3-D satellite time series deployed by the German Aerospace Association. All are based on the rasdaman array middleware which extends relational DBMSs with storage and retrieval capabilities for extremely large multidimensional arrays." VLDB Privacy-Preserving Indexing of Documents on the Network. Mayank Bawa,Roberto J. Bayardo Jr.,Rakesh Agrawal 2003 We address the problem of providing privacy-preserving search over distributed access-controlled content. Indexed documents can be easily reconstructed from conventional (inverted) indexes used in search. The need to avoid breaches of access-control through the index requires the index hosting site to be fully secured and trusted by by all participating content providers. This level of trust is impractical in the increasingly common case where multiple competing organizations or individuals wish to selectively share content. We propose a solution that eliminates the need of such a trusted authority. The solution builds a centralized privacy-preserving index in conjunction with a distributed access-control enforcing search protocol. The new index provides strong and quantifiable privacy guarantees that hold even if the entire index is made public. Experiments on a real-life dataset validate performance of the scheme. The appeal of our solution is two-fold: (a) Content providers maintain complete control in defining access groups and ensuring its compliance, and (b) System implementors retain tunable knobs to balance privacy and efficiency concerns for their particular domains. VLDB Efficient Query Processing for Multi-Dimensionally Clustered Tables in DB2. Bishwaranjan Bhattacharjee,Sriram Padmanabhan,Timothy Malkemus,Tony Lai,Leslie Cranston,Matthew Huras 2003 We have introduced a Multi-Dimensional Clustering (MDC) physical layout scheme in DB2 version 8.0 for relational tables. Multi-Dimensional Clustering is based on the definition of one or more orthogonal clustering attributes (or expressions) of a table. The table is organized physically by associating records with similar values for the dimension attributes in a cluster. Each clustering key is allocated one or more blocks of physical storage with the aim of storing the multiple records belonging to the cluster in almost contiguous fashion. Block oriented indexes are created to access these blocks. In this paper, we describe novel techniques for query processing operations that provide significant performance improvements for MDC tables. Current database systems employ a repertoire of access methods including table scans, index scans, index ANDing, and index ORing. We have extended these access methods for efficiently processing the block based MDC tables. One important concept at the core of processing MDC tables is the block oriented access technique. In addition, since MDC tables can include regular record oriented indexes, we employ novel techniques to combine block and record indexes. Block oriented processing is extended to nested loop joins and star joins as well. We show results from experiments using a star-schema database to validate our claims of performance with minimal overhead. VLDB Xcerpt and visXcerpt: From Pattern-Based to Visual Querying of XML and Semistructured Data. Sacha Berger,François Bry,Sebastian Schaffert,Christoph Wieser 2003 Xcerpt and visXcerpt: From Pattern-Based to Visual Querying of XML and Semistructured Data. VLDB Web Services (Industrial Session). Felipe Cabrera 2003 Web Services (Industrial Session). VLDB Chip-Secured Data Access: Reconciling Access Rights with Data Encryption. Luc Bouganim,François Dang Ngoc,Philippe Pucheral,Lilan Wu 2003 Chip-Secured Data Access: Reconciling Access Rights with Data Encryption. VLDB Operator Scheduling in a Data Stream Manager. Donald Carney,Ugur Çetintemel,Alex Rasin,Stanley B. Zdonik,Mitch Cherniack,Michael Stonebraker 2003 Many stream-based applications have sophisticated data processing requirements and real-time performance expectations that need to be met under high-volume, time-varying data streams. In order to address these challenges, we propose novel operator scheduling approaches that specify (1) which operators to schedule (2) in which order to schedule the operators, and (3) how many tuples to process at each execution step. We study our approaches in the context of the Aurora data stream manager. We argue that a fine-grained scheduling approach in combination with various scheduling techniques (such as batching of operators and tuples) can significantly improve system efficiency by reducing various system overheads. We also discuss application-aware extensions that make scheduling decisions according to per-application Quality of Service (QoS) specifications. Finally, we present prototype-based experimental results that characterize the efficiency and effectiveness of our approaches under various stream workloads and processing scenarios. VLDB Illuminating the Dark Side of Web Services. Michael L. Brodie 2003 Illuminating the Dark Side of Web Services. VLDB Constructing and integrating data-centric Web Applications: Methods, Tools, and Techniques. Stefano Ceri,Ioana Manolescu 2003 This tutorial deals with the construction of data-centric Web applications, focusing on the modelling of processes and on the integration with Web services. The tutorial describes the standards, methods, and tools that are commonly used for building these applications. VLDB Primitives for Workload Summarization and Implications for SQL. Surajit Chaudhuri,Prasanna Ganesan,Vivek R. Narasayya 2003 "Workload information has proved to be a crucial component for database-administration tasks as well as for analysis of query logs to understand user behavior and system usage. These tasks require the ability to summarize large SQL workloads. In this paper, we identify primitives that are important to enable many important workload-summarization tasks. These primitives also appear to be useful in a variety of practical scenarios besides workload summarization. Today's SQL is inadequate to express these primitives conveniently. We discuss possible extensions to SQL and the relational engine to efficiently support such summarization primitives." VLDB Who needs XML Databases? Sophie Cluet 2003 Who needs XML Databases? VLDB Privacy-Enhanced Data Management for Next-Generation e-Commerce. Chris Clifton,Irini Fundulaki,Richard Hull,Bharat Kumar,Daniel F. Lieuwen,Arnaud Sahuguet 2003 Privacy-Enhanced Data Management for Next-Generation e-Commerce. VLDB XSEarch: A Semantic Search Engine for XML. Sara Cohen,Jonathan Mamou,Yaron Kanza,Yehoshua Sagiv 2003 "XSEarch, a semantic search engine for XML, is presented. XSEarch has a simple query language, suitable for a naive user. It returns semantically related document fragments that satisfy the user's query. Query answers are ranked using extended information-retrieval techniques and are generated in an order similar to the ranking. Advanced indexing techniques were developed to facilitate efficient implementation of XSEarch. The performance of the different techniques as well as the recall and the precision were measured experimentally. These experiments indicate that XSEarch is efficient, scalable and ranks quality results highly." VLDB RRXF: Redundancy reducing XML storage in relations. Yi Chen,Susan B. Davidson,Carmem S. Hara,Yifeng Zheng 2003 Current techniques for storing XML using relational technology consider the structure of an XML document but ignore its semantics as expressed by keys or functional dependencies. However, when the semantics of a document are considered redundancy may be reduced, node identifiers removed where value-based keys are available, and semantic constraints validated using relational primary key technology. In this paper, we propose a novel constraint definition called XFDs that capture structural as well as semantic information. We present a set of rewriting rules for XFDs, and use them to design a polynomial time algorithm which, given an input set of XFDs, computes a reduced set of XFDs. Based on this algorithm, we present a redundancy removing storage mapping from XML to relations called RRXS. The effectiveness of the mapping is demonstrated by experiments on three data sets. VLDB From Tree Patterns to Generalized Tree Patterns: On Efficient Evaluation of XQuery. Zhimin Chen,H. V. Jagadish,Laks V. S. Lakshmanan,Stelios Paparizos 2003 XQuery is the de facto standard XML query language, and it is important to have efficient query evaluation techniques available for it. A core operation in the evaluation of XQuery is the finding of matches for specified tree patterns, and there has been much work towards algorithms for finding such matches efficiently. Multiple XPath expressions can be evaluated by computing one or more tree pattern matches. However, relatively little has been done on efficient evaluation of XQuery queries as a whole. In this paper, we argue that there is much more to XQuery evaluation than a tree pattern match. We propose a structure called generalized tree pattern (GTP) for concise representation of a whole XQuery expression. Evaluating the query reduces to finding matches for its GTP. Using this idea we develop efficient evaluation plans for XQuery expressions, possibly involving join, quantifiers, grouping, aggregation, and nesting. XML data often conforms to a schema. We show that using relevant constraints from the schema, one can optimize queries significantly, and give algorithms for automatically inferring GTP simplifications given a schema. Finally, we show, through a detailed set of experiments using the TIMBER XML database system, that plans via GTPs (with or without schema knowledge) significantly outperform plans based on navigation and straightforward plans obtained directly from the query. VLDB VIPAS: Virtual Link Powered Authority Search in the Web. Chi-Chun Lin,Ming-Syan Chen 2003 With the exponential growth of the World Wide Web, looking for pages with high quality and relevance in the Web has become an important research field. There have been many keyword-based search engines built for this purpose. However, these search engines usually suffer from the problem that a relevant Web page may not contain the keyword in its page text. Algorithms exploiting the link structure of Web documents, such as HITS, have also been proposed to overcome the problems of traditional search engines. Though these algorithms perform better than keyword-based search engines, they still have some defects. Among others, one major problem is that links in Web pages are only able to reflect the view of the page authors on the topic of those pages but not that of the page readers. In this paper, we propose a new algorithm with the idea of using virtual links which are created according to what the user behaves in browsing the output list of the query result. These virtual links are then employed to identify authoritative resources in the Web. Speci fically, the algorithm, referred to as algorithm VIPAS (standing for virtual link powered authority search), is divided into three phases. The first phase performs basic link analysis. The second phase collects statistics by observing the user behavior in browsing pages listed in the query result, and virtual links are then created according to what observed. In the third phase, these virtual links as well as real ones are taken together to produce an updated list of authoritative pages that will be presented to the user when the query with similar keywords is encountered next time. A Web warehouse is built and the algorithm is integrated into the system. By conducting experiments on the system, we have shown that VIPAS is not only very effective but also very adaptive in providing much more valuable information to users. VLDB A Nanotechnology-based Approach to Data Storage. Evangelos Eleftheriou,Peter Bächtold,Giovanni Cherubini,Ajay Dholakia,Christoph Hagleitner,T. Loeliger,Aggeliki Pantazi,Haralampos Pozidis,T. R. Albrecht,Gerd Karl Binnig,Michel Despont,Ute Drechsler,Urs Dürig,Bernd Gotsmann,Daniel Jubin,Walter Häberle,Mark A. Lantz,Hugo E. Rothuizen,Richard Stutz,Peter Vettiger,Dorothea Wiesmann 2003 "Ultrahigh storage densities of up to 1 Tb/in2. or more can be achieved by using local-probe techniques to write, read back, and erase data in very thin polymer films. The thermomechanical scanning-probe-based data-storage concept, internally dubbed ""millipede"", combines ultrahigh density, small form factor, and high data rates. High data rates are achieved by parallel operation of large 2D arrays with thousands micro/nanomechanical cantilevers/tips that can be batch-fabricated by silicon surface-micromachining techniques. The inherent parallelism, the ultrahigh areal densities and the small form factor may open up new perspectives and opportunities for application in areas beyond those envisaged today." VLDB Finding Hierarchical Heavy Hitters in Data Streams. Graham Cormode,Flip Korn,S. Muthukrishnan,Divesh Srivastava 2003 "Aggregation along hierarchies is a critical summary technique in a large variety of on-line applications including decision support and network management (e.g., IP clustering, denial-of-service attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on sets (e.g., quantiles, hot items), surprisingly little attention has been paid to summarizing hierarchical structure in stream data. The problem we study in this paper is that of finding Hierarchical Heavy Hitters (HHH): given a hierarchy and a fraction φ, we want to find all HHH nodes that have a total number of descendants in the data stream no smaller than φ of the total number of elements in the data stream, after discounting the descendant nodes that are HHH nodes. The resulting summary gives a topological ""cartogram"" of the hierarchical data. We present deterministic and randomized algorithms for finding HHHs, which builds upon existing techniques by incorporating the hierarchy into the algorithms. Our experiments demonstrate several factors of improvement in accuracy over the straightforward approach, which is due to making algorithms hierarchy-aware." VLDB MARS: A System for Publishing XML from Mixed and Redundant Storage. Alin Deutsch,Val Tannen 2003 We present a system for publishing as XML data from mixed (relational+XML) proprietary storage, while supporting redundancy in storage for tuning purposes. The correspondence between public and proprietary schemas is given by a combination of LAV-and GAV-style views expressed in XQuery. XML and relational integrity constraints are also taken into consideration. Starting with client XQueries formulated against the public schema the system achieves the combined effect of rewriting-with-views, composition-with-views and query minimization under integrity constraints to obtain optimal reformulations against the proprietary schema. The paper focuses on the engineering and the experimental evaluation of the MARS system. VLDB Query Processing for High-Volume XML Message Brokering. Yanlei Diao,Michael J. Franklin 2003 XML filtering solutions developed to date have focused on the matching of documents to large numbers of queries but have not addressed the customization of output needed for emerging distributed information infrastructures. Support for such customization can significantly increase the complexity of the filtering process. In this paper, we show how to leverage an efficient, shared path matching engine to extract the specific XML elements needed to generate customized output in an XML Message Broker. We compare three different approaches that differ in the degree to which they exploit the shared path matching engine. We also present techniques to optimize the post-processing of the path matching engine output, and to enable the sharing of such processing across queries. We evaluate these techniques with a detailed performance study of our implementation. VLDB The Semantic Web: Semantics for Data on the Web. Stefan Decker,Vipul Kashyap 2003 In our tutorial on Semantic Web (SW) technology, we explain the why, the various technology thrusts and the relationship to database technology. The motivation behind presenting this tutorial is discussed and the framework of the tutorial along with the various component technologies and research areas related to the Semantic Web is presented. VLDB Implementing Xquery 1.0: The Galax Experience. Mary F. Fernández,Jérôme Siméon,Byron Choi,Amélie Marian,Gargi Sur 2003 Galax is a light-weight, portable, open-source implementation of XQuery 1.0. Started in December 2000 as a small prototype designed to test the XQuery static type system, Galax has now become a solid implementation, aiming at full conformance with the family of XQuery 1.0 specifications. Because of its completeness and open architecture, Galax also turns out to be a very convenient platform for researchers interested in experimenting with XQuery optimization. We demonstrate the Galax system as well as its most advanced features, including support for XPath 2.0, XML Schema and static type-checking. We also present some of our first experiments with optimization. Notably, we demonstrate query rewriting capabilities in the Galax compiler, and the ability to run queries on documents up to a Gigabyte without the need for preindexing. Although early versions of Galax have been shown in industrial conferences over the last two years, this is the first time it is demonstrated in the database community. VLDB On the minimization of Xpath queries. Sergio Flesca,Filippo Furfaro,Elio Masciari 2003 XML queries are usually expressed by means of XPath expressions identifying portions of the selected documents. An XPath expression defines a way of navigating an XML tree and returns the set of nodes which are reachable from one or more starting nodes through the paths specified by the expression. The problem of efficiently answering XPath queries is very interesting and has recently received increasing attention by the research community. In particular, an increasing effort has been devoted to define effective optimization techniques for XPath queries. One of the main issues related to the optimization of XPath queries is their minimization. The minimization of XPath queries has been studied for limited fragments of XPath, containing only the descendent, the child and the branch operators. In this work, we address the problem of minimizing XPath queries for a more general fragment, containing also the wildcard operator. We characterize the complexity of the minimization of XPath queries, stating that it is NP-hard, and propose an algorithm for computing minimum XPath queries. Moreover, we identify an interesting tractable case and propose an ad hoc algorithm handling the minimization of this kind of queries in polynomial time. VLDB The BEA/XQRL Streaming XQuery Processor. Daniela Florescu,Chris Hillery,Donald Kossmann,Paul Lucas,Fabio Riccardi,Till Westmann,Michael J. Carey,Arvind Sundararajan,Geetika Agrawal 2003 "In this paper, we describe the design, implementation, and performance characteristics of a complete, industrial-strength XQuery engine, the BEA streaming XQuery processor. The engine was designed to provide very high performance for message processing applications, i.e., for transforming XML data streams, and it is a central component of the 8.1 release of BEA's WebLogic Integration (WLI) product. This XQuery engine is fully compliant with the August 2002 draft of the W3C XML Query Language specification. A goal of this paper is to describe how an efficient, fully compliant XQuery engine can be built from a few relatively simple components and well-understood technologies." VLDB Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. Lukasz Golab,M. Tamer Özsu 2003 Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams. VLDB Locating Data Sources in Large Distributed Systems. Leonidas Galanis,Yuan Wang,Shawn R. Jeffery,David J. DeWitt 2003 Querying large numbers of data sources is gaining importance due to increasing numbers of independent data providers. One of the key challenges is executing queries on all relevant information sources in a scalable fashion and retrieving fresh results. The key to scalability is to send queries only to the relevant servers and avoid wasting resources on data sources which will not provide any results. Thus, a catalog service, which would determine the relevant data sources given a query, is an essential component in efficiently processing queries in a distributed environment. This paper proposes a catalog framework which is distributed across the data sources themselves and does not require any central infrastructure. As new data sources become available, they automatically become part of the catalog service infrastructure, which allows scalability to large numbers of nodes. Furthermore, we propose techniques for workload adaptability. Using simulation and real-world data we show that our approach is valid and can scale to thousands of data sources. VLDB Statistics on Views. César A. Galindo-Legaria,Milind Joshi,Florian Waas,Ming-Chuan Wu 2003 The quality of execution plans generated by a query optimizer is tied to the accuracy of its cardinality estimation. Errors in estimation lead to poor performance, erratic behavior, and user frustration. Traditionally, the optimizer is restricted to use only statistics on base table columns and derive estimates bottom-up. This approach has shortcomings with dealing with complex queries, and with rich languages such as SQL: Errors grow as estimation is done on top of estimation, and some constructs are simply not handled. In this paper we describe the creation and utilization of statistics on views in SQL Server, which provides the optimizer with statistical information on the result of scalar or relational expressions. It opens a new dimension on the data available for cardinality estimation and enables arbitrary correction. We describe the implementation of this feature in the optimizer architecture, and show its impact on the quality of plans generated through a number of examples. VLDB Temporal Slicing in the Evaluation of XML Queries. Dengfeng Gao,Richard T. Snodgrass 2003 As with relational data, XML data changes over time with the creation, modification, and deletion of XML documents. Expressing queries on time-varying (relational or XML) data is more difficult than writing queries on nontemporal data. In this paper, we present a temporal XML query language, τXQuery, in which we add valid time support to XQuery by minimally extending the syntax and semantics of XQuery. We adopt a stratum approach which maps a τXQuery query to a conventional XQuery. The paper focuses on how to perform this mapping, in particular, on mapping sequenced queries, which are by far the most challenging. The critical issue of supporting sequenced queries (in any query language) is time-slicing the input data while retaining period timestamping. Timestamps are distributed throughout an XML document, rather than uniformly in tuples, complicating the temporal slicing while also providing opportunities for optimization. We propose four optimizations of our initial maximally-fragmented time-slicing approach: selected node slicing, copy-based per-expression slicing, in-place per-expression slicing, and idiomatic slicing, each of which reduces the number of constant periods over which the query is evaluated. While performance tradeoffs clearly depend on the underlying XQuery engine, we argue that there are queries that favor each of the five approaches. VLDB Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps. Torsten Grust,Maurice van Keulen,Jens Teubner 2003 Relational query processors derive much of their effectiveness from the awareness of specific table properties like sort order, size, or absence of duplicate tuples. This text applies (and adapts) this successful principle to database-supported XML and XPath processing: the relational system is made tree aware, i.e., tree properties like subtree size, intersection of paths, inclusion or disjointness of subtrees are made explicit. We propose a local change to the database kernel, the staircase join, which encapsulates the necessary tree knowledge needed to improve XPath performance. Staircase join operates on an XML encoding which makes this knowledge available at the cost of simple integer operations (e.g., +, ≤ ). We finally report on quite promising experiments with a staircase join enhanced main-memory database kernel. VLDB BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data. Paul Brown,Peter J. Haas 2003 "We present the BHUNT scheme for automatically discovering algebraic constraints between pairs of columns in relational data. The constraints may be ""fuzzy"" in that they hold for most, but not all, of the records, and the columns may be in the same table or different tables. Such constraints are of interest in the context of both data mining and query optimization, and the BHUNT methodology can potentially be adapted to discover fuzzy functional dependencies and other useful relationships. BHUNT first identifies candidate sets of column value pairs that are likely to satisfy an algebraic constraint. This discovery process exploits both system catalog information and data samples, and employs pruning heuristics to control processing costs. For each candidate, BHUNT constructs algebraic constraints by applying statistical histogramming, segmentation, or clustering techniques to samples of column values. Using results from the theory of tolerance intervals, the sample sizes can be chosen to control the number of ""exception"" records that fail to satisfy the discovered constraints. In query-optimization mode, BHUNT can automatically partition the data into normal and exception records. During subsequent query processing, queries can be modified to incorporate the constraints; the optimizer uses the constraints to identify new, more efficient access paths. The results are then combined with the results of executing the original query against the (small) set of exception records. Experiments on a very large database using a prototype implementation of BHUNT show reductions in table accesses of up to two orders of magnitude, leading to speedups in query processing by up to a factor of 6.8." VLDB Integrated Data Management for Mobile Services in the Real World. Christian Hage,Christian S. Jensen,Torben Bach Pedersen,Laurynas Speicys,Igor Timko 2003 Market research companies predict a huge market for services to be delivered to mobile users. Services include route guidance, point-of-interest search, metering services such as road pricing and parking payment, traffic monitoring, etc. We believe that no single such service will be the killer service, but that suites of integrated services are called for. Such integrated services reuse integrated content obtained from multiple content providers. This paper describes concepts and techniques underlying the data management system deployed by a Danish mobile content integrator. While georeferencing of content is important, it is even more important to relate content to the transportation infrastructure. The data management system thus relies on several sophisticated, integrated representations of the infrastructure, each of which supports its own kind of use. The paper covers data modeling, querying, and update, as well as the applications using the system. VLDB Mixed Mode XML Query Processing. Alan Halverson,Josef Burger,Leonidas Galanis,Ameet Kini,Rajasekar Krishnamurthy,Ajith Nagaraja Rao,Feng Tian,Stratis Viglas,Yuan Wang,Jeffrey F. Naughton,David J. DeWitt 2003 Querying XML documents typically involves both tree-based navigation and pattern matching similar to that used in structured information retrieval domains. In this paper, we show that for good performance, a native XML query processing system should support query plans that mix these two processing paradigms. We describe our prototype native XML system, and report on experiments demonstrating that even for simple queries, there are a number of options for how to combine tree-based navigation and structural joins based on information retrieval-style inverted lists, and that these options can have widely varying performance. We present ways of transparently using both techniques in a single system, and provide a cost model for identifying efficient combinations of the techniques. Our preliminary experimental results prove the viability of our approach. VLDB Scheduling for shared window joins over data streams. Moustafa A. Hammad,Michael J. Franklin,Walid G. Aref,Ahmed K. Elmagarmid 2003 Continuous Query (CQ) systems typically exploit commonality among query expressions to achieve improved efficiency through shared processing. Recently proposed CQ systems have introduced window specifications in order to support unbounded data streams. There has been, however, little investigation of sharing for windowed query operators. In this paper, we address the shared execution of windowed joins, a core operator for CQ systems. We show that the strategy used in systems to date has a previously unreported performance flaw that can negatively impact queries with relatively small windows. We then propose two new execution strategies for shared joins. We evaluate the alternatives using both analytical models and implementation in a DBMS. The results show that one strategy, called MQT, provides the best performance over a range of workload settings. VLDB Data Morphing: An Adaptive, Cache-Conscious Storage Technique. Richard A. Hankins,Jignesh M. Patel 2003 The number of processor cache misses has a critical impact on the performance of DBMSs running on servers with large main-memory configurations. In turn, the cache utilization of database systems is highly dependent on the physical organization of the records in main-memory. A recently proposed storage model, called PAX, was shown to greatly improve the performance of sequential file-scan operations when compared to the commonly implemented N-ary storage model. However, the PAX storage model can also demonstrate poor cache utilization for other common operations, such as index scans. Under a workload of heterogenous database operations, neither the PAX storage model nor the N-ary storage model is optimal. In this paper, we propose a flexible data storage technique called Data Morphing. Using Data Morphing, a cache-efficient attribute layout, called a partition, is first determined through an analysis of the query workload. This partition is then used as a template for storing data in a cache-efficient way. We present two algorithms for computing partitions, and also present a versatile storage model that accommodates the dynamic reorganization of the attributes in a file. Finally, we experimentally demonstrate that the Data Morphing technique provides a significant performance improvement over both the traditional N-ary storage model and the PAX model. VLDB XISS/R: XML Indexing and Storage System using RDBMS. Philip J. Harding,Quanzhong Li,Bongki Moon 2003 We demonstrate the XISS/R system, an implementation of the XML Indexing and Storage System (XISS) on top of a relational database. The system is based on the XISS extended preorder numbering scheme, which captures the nesting structure of XML data and provides the opportunity for storage and query processing independent of the particular structure of the data. The system includes a web-based user interface, which enables stored documents to be queried via XPath. The user interface utilizes the XPath Query Engine, which automatically translates XPath queries into efficient SQL statements. VLDB WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. Hai He,Weiyi Meng,Clement T. Yu,Zonghuan Wu 2003 More and more databases are becoming Web accessible through form-based search interfaces, and many of these sources are E-commerce sites. Providing a unified access to multiple E-commerce search engines selling similar products is of great importance in allowing users to search and compare products from multiple sites with ease. One key task for providing such a capability is to integrate the Web interfaces of these E-commerce search engines so that user queries can be submitted against the integrated interface. Currently, integrating such search interfaces is carried out either manually or semi-automatically, which is inefficient and difficult to maintain. In this paper, we present WISE-Integrator - a tool that performs automatic integration of Web Interfaces of Search Engines. WISE-Integrator employs sophisticated techniques to identify matching attributes from different search interfaces for integration. It also resolves domain differences of matching attributes. Our experimental results based on 20 and 50 interfaces in two different domains indicate that WISE-Integrator can achieve high attribute matching accuracy and can produce high-quality integrated search interfaces without human interactions. VLDB Estimating the Output Cardinality of Partial Preaggregation with a Measure of Clusteredness. Sven Helmer,Thomas Neumann,Guido Moerkotte 2003 We introduce a new parameter, the clusteredness of data, and show how it can be used for estimating the output cardinality of a partial preaggregation operator. This provides the query optimizer with an important piece of information for deciding whether the application of partial preaggregation is beneficial. Experimental results are very promising, due to the high accuracy of the cardinality estimation based on our measure of clusteredness. VLDB COMBI-Operator: Database Support for Data Mining Applications. Alexander Hinneburg,Wolfgang Lehner,Dirk Habich 2003 COMBI-Operator: Database Support for Data Mining Applications. VLDB Efficient IR-Style Keyword Search over Relational Databases. Vagelis Hristidis,Luis Gravano,Yannis Papakonstantinou 2003 "Applications in which plain text coexists with structured data are pervasive. Commercial relational database management systems (RDBMSs) generally provide querying capabilities for text attributes that incorporate state-of-the-art information retrieval (IR) relevance ranking strategies, but this search functionality requires that queries specify the exact column or columns against which a given list of keywords is to be matched. This requirement can be cumbersome and inflexible from a user perspective: good answers to a keyword query might need to be ""assembled"" -in perhaps unforeseen ways- by joining tuples from multiple relations. This observation has motivated recent research on free-form keyword search over RDBMSs. In this paper, we adapt IR-style document-relevance ranking strategies to the problem of processing free-form keyword queries over RDBMSs. Our query model can handle queries with both AND and OR semantics, and exploits the sophisticated single-column text-search functionality often available in commercial RDBMSs. We develop query-processing strategies that build on a crucial characteristic of IR-style keyword search: only the few most relevant matches -according to some definition of ""relevance""- are generally of interest. Consequently, rather than computing all matches for a keyword query, which leads to inefficient executions, our techniques focus on the top-k matches for the query, for moderate values of k. A thorough experimental evaluation over real data shows the performance advantages of our approach." VLDB Querying the Internet with PIER. Ryan Huebsch,Joseph M. Hellerstein,Nick Lanham,Boon Thau Loo,Scott Shenker,Ion Stoica 2003 The database research community prides itself on scalable technologies. Yet database systems traditionally do not excel on one important scalability dimension: the degree of distribution. This limitation has hampered the impact of database technologies on massively distributed systems like the Internet. In this paper, we present the initial design of PIER, a massively distributed query engine based on overlay networks, which is intended to bring database query processing facilities to new, widely distributed environments. We motivate the need for massively distributed queries, and argue for a relaxation of certain traditional database research goals in the pursuit of scalability and widespread adoption. We present simulation results showing PIER gracefully running relational queries across thousands of machines, and show results from the same software base in actual deployment on a large experimental cluster. VLDB AniPQO: Almost Non-intrusive Parametric Query Optimization for Nonlinear Cost Functions. Arvind Hulgeri,S. Sudarshan 2003 The cost of a query plan depends on many parameters, such as predicate selectivities and available memory, whose values may not be known at optimization time. Parametric query optimization (PQO) optimizes a query into a number of candidate plans, each optimal for some region of the parameter space. We propose a heuristic solution for the PQO problem for the case when the cost functions may be nonlinear in the given parameters. This solution is minimally intrusive in the sense that an existing query optimizer can be used with minor modifications. We have implemented the heuristic and the results of the tests on the TPCD benchmark indicate that the heuristic is very effective. The minimal intrusiveness, generality in terms of cost functions and number of parameters and good performance (up to 4 parameters) indicate that our solution is of significant practical importance. VLDB Supporting Top-k Join Queries in Relational Databases. Ihab F. Ilyas,Walid G. Aref,Ahmed K. Elmagarmid 2003 Ranking queries produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance. VLDB The History of Histograms (abridged). Yannis E. Ioannidis 2003 "The history of histograms is long and rich, full of detailed information in every step. It includes the course of histograms in different scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that have been given on a great variety of histogram-related problems. In this paper and in the same spirit of the histogram techniques themselves, we compress their entire history (including their ""future history"" as currently anticipated) in the given/fixed space budget, mostly recording details for the periods, events, and results with the highest (personally-biased) interest. In a limited set of experiments, the semantic distance between the compressed and the full form of the history was found relatively small!" VLDB Continuous K-Nearest Neighbor Queries for Continuously Moving Points with Updates. Glenn S. Iwerks,Hanan Samet,Kenneth P. Smith 2003 In recent years there has been an increasing interest in databases of moving objects where the motion and extent of objects are represented as a function of time. The focus of this paper is on the maintenance of continuous K- nearest neighbor (k-NN) queries on moving points when updates are allowed. Updates change the functions describing the motion of the points, causing pending events to change. Events are processed to keep the query result consistent as points move. It is shown that the cost of maintaining a continuous k-NN query result for moving points represented in this way can be significantly reduced with a modest increase in the number of events processed in the presence of updates. This is achieved by introducing a continuous within query to filter the number of objects that must be taken into account when maintaining a continuous k-NN query. This new approach is presented and compared with other recent work. Experimental results are presented showing the utility of this approach. VLDB Grid Data Management Systems & Services. Arun Jagatheesan,Reagan Moore,Norman W. Paton,Paul Watson 2003 The Grid is an emerging infrastructure for providing coordinated and consistent access to distributed, heterogeneous computational and information storage resources amongst autonomous organizations. Data grids are being built across the world as the next generation data handling systems for sharing access to data and storage systems within multiple administrative domains. A data grid provides logical name spaces for digital entities and storage resources to create global identifiers that are location independent. Data grid systems provide services on the logical name space for the manipulation, management, and organization of digital entities. Databases are increasingly being used within Grid applications for data and metadata management, and several groups are now developing services for the access and integration of structured data on the Grid. The service-based approach to making data available on the Grid is being encouraged by the adoption of the Open Grid Services Architecture (OGSA), which is bringing about the integration of the Grid with Web Service technologies. The tutorial will introduce the Grid, and examine the requirements, issues and possible solutions for integrating data into the Grid. It will take examples from current systems, in particular the SDSC Storage Resource Broker and the OGSA-Database Access and Integration project. VLDB Robust Estimation With Sampling and Approximate Pre-Aggregation. Chris Jermaine 2003 "The majority of data reduction techniques for approximate query processing (such as wavelets, histograms, kernels, and so on) are not usually applicable to categorical data. There has been something of a disconnect between research in this area and the reality of data-base data; much recent research has focused on approximate query processing over ordered or numerical attributes, but arguably the majority of database attributes are categorical: country, state, job_title, color, sex, department, and so on. This paper considers the problem of approximation of aggregate functions over categorical data, or mixed categorical/numerical data. We propose a method based upon random sampling, called Approximate Pre-Aggregation (APA). The biggest drawback of sampling for aggregate function estimating is the sensitivity of sampling to attribute value skew, and APA uses several techniques to overcome this sensitivity. The increase in accuracy using APA compared to ""plain vanilla"" sampling is dramatic. For SUM and AVG queries, the relative error for random sampling alone is more than 700% greater than for sampling with APA. Even if stratified sampling techniques are used, the error is still between 28% and 175% greater than for APA." VLDB Holistic Twig Joins on Indexed XML Documents. Haifeng Jiang,Wei Wang,Hongjun Lu,Jeffrey Xu Yu 2003 Finding all the occurrences of a twig pattern specified by a selection predicate on multiple elements in an XML document is a core operation for efficient evaluation of XML queries. Holistic twig join algorithms were proposed recently as an optimal solution when the twig pattern only involves ancestor-descendant relationships. In this paper, we address the problem of efficient processing of holistic twig joins on all/partly indexed XML documents. In particular, we propose an algorithm that utilizes available indices on element sets. While it can be shown analytically that the proposed algorithm is as efficient as the existing state-of-the-art algorithms in terms of worst case I/O and CPU cost, experimental results on various datasets indicate that the proposed index-based algorithm performs significantly better than the existing ones, especially when binary structural joins in the twig pattern have varying join selectivities. VLDB A Database Striptease or How to Manage Your Personal Databases. Martin L. Kersten,Gerhard Weikum,Michael J. Franklin,Daniel A. Keim,Alejandro P. Buchmann,Surajit Chaudhuri 2003 A Database Striptease or How to Manage Your Personal Databases. VLDB Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach. Christoph Koch 2003 We propose a new, highly scalable and efficient technique for evaluating node-selecting queries on XML trees which is based on recent advances in the theory of tree automata. Our query processing techniques require only two linear passes over the XML data on disk, and their main memory requirements are in principle independent of the size of the data. The overall running time is O(m + n), where monly depends on the query and n is the size of the data. The query language supported is very expressive and captures exactly all node-selecting queries answerable with only a bounded amount of memory (thus, all queries that can be answered by any form of finite-state system on XML trees). Visiting each tree node only twice is optimal, and current automata-based approaches to answering path queries on XML streams, which work using one linear scan of the stream, are considerably less expressive. These technical results - which give rise to expressive query engines that deal more efficiently with large amounts of data in secondary storage - are complemented with an experimental evaluation of our work. VLDB Path Queries on Compressed XML. Peter Buneman,Martin Grohe,Christoph Koch 2003 Central to any XML query language is a path language such as XPath which operates on the tree structure of the XML document. We demonstrate in this paper that the tree structure can be effectively compressed and manipulated using techniques derived from symbolic model checking. Specifically, we show first that succinct representations of document tree structures based on sharing subtrees are highly effective. Second, we show that compressed structures can be queried directly and efficiently through a process of manipulating selections of nodes and partial decompression. We study both the theoretical and experimental properties of this technique and provide algorithms for querying our compressed instances using node-selecting path query languages such as XPath. We believe the ability to store and manipulate large portions of the structure of very large XML documents in main memory is crucial to the development of efficient, scalable native XML databases and query engines. VLDB Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases. Flip Korn,S. Muthukrishnan,Yunyue Zhu 2003 Internet Service Providers (ISPs) use real-time data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topological mislabelings, etc.) and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality.In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems. VLDB Efficient Approximation Of Optimization Queries Under Parametric Aggregation Constraints. Sudipto Guha,Dimitrios Gunopulos,Nick Koudas,Divesh Srivastava,Michail Vlachos 2003 We introduce and study a new class of queries that we refer to as OPAC (optimization under parametric aggregation constraints) queries. Such queries aim to identify sets of database tuples that constitute solutions of a large class of optimization problems involving the database tuples. The constraints and the objective function are specified in terms of aggregate functions of relational attributes, and the parameter values identify the constants used in the aggregation constraints. We develop algorithms that preprocess relations and construct indices to efficiently provide answers to OPAC queries. The answers returned by our indices are approximate, not exact, and provide guarantees for their accuracy. Moreover, the indices can be tuned easily to meet desired accuracy levels, providing a graceful tradeoff between answer accuracy and index space. We present the results of a thorough experimental evaluation analyzing the impact of several parameters on the accuracy and performance of our techniques. Our results indicate that our methodology is effective and can be deployed easily, utilizing index structures such as R-trees. VLDB Data Stream Query Processing: A Tutorial. Nick Koudas,Divesh Srivastava 2003 Data Stream Query Processing: A Tutorial. VLDB Coarse-Grained Optimization: Techniques for Rewriting SQL Statement Sequences. Tobias Kraft,Holger Schwarz,Ralf Rantzau,Bernhard Mitschang 2003 "Relational OLAP tools and other database applications generate sequences of SQL statements that are sent to the database server as result of a single information request provided by a user. Unfortunately, these sequences cannot be processed efficiently by current database systems because they typically optimize and process each statement in isolation. We propose a practical approach for this optimization problem, called ""coarse-grained optimization,"" complementing the conventional query optimization phase. This new approach exploits the fact that statements of a sequence are correlated since they belong to the same information request. A lightweight heuristic optimizer modifies a given statement sequence using a small set of rewrite rules. Since the optimizer is part of a separate system layer, it is independent of but can be tuned to a specific underlying database system. We discuss implementation details and demonstrate that our approach leads to significant performance improvements." VLDB The Zero-Delay Data Warehouse: Mobilizing Heterogeneous Databases. eva Kühn 2003 """Now is the time... for the real-time enterprise"": In spite of this assertion from Gartner Group the heterogeneity of today's IT environments and the increasing demands from mobile users are major obstacles for the creation of this vision. Yet its technical foundation is available: software architectures based on innovative middleware components that offer a level of abstraction superior to conventional middleware solutions, including distributed transactions and the seamless integration of mobile devices using open standards, crossing the borders between heterogeneous platforms and systems. Space based computing is a new middleware paradigm meeting these demands. As an example we present the real time build-up of data warehouses." VLDB On the Costs of Multilingualism in Database Systems. A. Kumaran,Jayant R. Haritsa 2003 "Database engines are well-designed for storing and processing text data based on Latin scripts. But in today's global village, databases should ideally support multilingual text data equally efficiently. While current database systems do support management of multilingual data, we are not aware of any prior studies that compare and quantify their performance in this regard. In this paper, we first compare the multilingual functionality provided by a suite of popular database systems. We find that while the systems support most SQL-defined multilingual functionality, some needed features are not yet implemented. We then profile their performance in handling text data in IS0:8859, the standard database character set, and in Unicode, the multilingual character set. Our experimental results indicate significant performance degradation while handling multilingual data in these database systems. Worse, we find that the query optimizer's accuracy is different between standard and multilingual data types. As a first step towards alleviating the above problems, we propose Cuniform, a compressed format that is trivially convertible to Unicode. Our initial experimental results with Cuniform indicate that it largely eliminates the performance degradation for multilingual scripts with small repertoires. Further, the Cuniform format can elegantly support extensions to SQL for multilexical text processing." VLDB Balancing Performance and Data Freshness in Web Database Servers. Alexandros Labrinidis,Nick Roussopoulos 2003 Personalization, advertising, and the sheer volume of online data generate a staggering amount of dynamic web content. In addition to web caching, View Materialization has been shown to accelerate the generation of dynamic web content. View materialization is an attractive solution as it decouples the serving of access requests from the handling of updates. In the context of the Web, selecting which views to materialize must be decided online and needs to consider both performance and data freshness, which we refer to as the Online View Selection problem. In this paper, we define data freshness metrics, provide an adaptive algorithm for the online view selection problem, and present experimental results. VLDB Efficacious Data Cube Exploration by Semantic Summarization and Compression. Laks V. S. Lakshmanan,Jian Pei,Yan Zhao 2003 Data cube is the core operator in data warehousing and OLAP. Its efficient computation, maintenance, and utilization for query answering and advanced analysis have been the subjects of numerous studies. However, for many applications, the huge size of the data cube limits its applicability as a means for semantic exploration by the user. Recently, we have developed a systematic approach to achieve efficacious data cube construction and exploration by semantic summarization and compression. Our approach is pivoted on a notion of quotient cube that groups together structurally related data cube cells with common (aggregate) measure values into equivalence classes. The equivalence relation used to partition the cube lattice preserves the roll-up/drill-down semantics of the data cube, in that the same kind of explorations can be conducted in the quotient cube as in the original cube, between classes instead of between cells. We have also developed compact data structures for representing a quotient cube and efficient algorithms for answering queries using a quotient cube for its incremental maintenance against updates. We have implemented SOCQET, a prototype data warehousing system making use of our results on quotient cube. In this demo, we will demonstrate (1) the critical techniques of building a quotient cube; (2) use of a quotient cube to answer various queries and to support advanced OLAP; (3) an empirical study on the effectiveness and efficiency of quotient cube-based data warehouses and OLAP; (4) a user interface for visual and interactive OLAP; and (5) SOCQET, a research prototype data warehousing system integrating all the techniques. The demo reflects our latest research results and may stimulate some interesting future studies. VLDB Supporting Frequent Updates in R-Trees: A Bottom-Up Approach. Mong-Li Lee,Wynne Hsu,Christian S. Jensen,Bin Cui,Keng Lik Teo 2003 Advances in hardware-related technologies promise to enable new data management applications that monitor continuous processes. In these applications, enormous amounts of state samples are obtained via sensors and are streamed to a database. Further, updates are very frequent and may exhibit locality. While the R-tree is the index of choice for multi-dimensional data with low dimensionality, and is thus relevant to these applications, R-tree updates are also relatively inefficient. We present a bottom-up update strategy for R-trees that generalizes existing update techniques and aims to improve update performance. It has different levels of reorganization--ranging from global to local--during updates, avoiding expensive top-down updates. A compact main-memory summary structure that allows direct access to the R-tree index nodes is used together with efficient bottom-up algorithms. Empirical studies indicate that the bottom-up strategy outperforms the traditional top-down technique, leads to indices with better query performance, achieves higher throughput, and is scalable. VLDB AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments. Alberto Lerner,Dennis Shasha 2003 "An order-dependent query is one whose result (interpreted as a multiset) changes if the order of the input records is changed. In a stock-quotes database, for instance, retrieving all quotes concerning a given stock for a given day does not depend on order, because the collection of quotes does not depend on order. By contrast, finding a stock's five-price moving-average in a trades table gives a result that depends on the order of the table. Query languages based on the relational data model can handle order-dependent queries only through add-ons. SQL:1999, for instance, has a new ""window"" mechanism which can sort data in limited parts of a query. Add-ons make order-dependent queries di_cult to write and to optimize. In this paper we show that order can be a natural property of the underlying data model and algebra. We introduce a new query language and algebra, called AQuery, that supports order from-the-ground-up. New order-related query transformations arise in this setting. We show by experiment that this framework - language plus optimization techniques - brings orders-of-magnitude improvement over SQL:1999 systems on many natural order-dependent queries." VLDB Grid and Applications (Industrial Session). Frank Leymann 2003 Grid and Applications (Industrial Session). VLDB CachePortal II: Acceleration of Very Large Scale Data Center-Hosted Database-driven Web Applications. Wen-Syan Li,Oliver Po,Wang-Pin Hsiung,K. Selçuk Candan,Divyakant Agrawal,Yusuf Akca,Kunihiro Taniguchi 2003 CachePortal II: Acceleration of Very Large Scale Data Center-Hosted Database-driven Web Applications. VLDB SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads. Lipyeow Lim,Min Wang,Jeffrey Scott Vitter 2003 Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. These selectivities are typically used for cost-based query optimization. While the problem of building an accurate histogram for a given attribute or attribute set has been well-studied, little attention has been given to the problem of building and tuning a set of histograms collectively for multidimensional queries in a self-managed manner based only on query feedback. In this paper, we present SASH, a Self-Adaptive Set of Histograms that addresses the problem of building and maintaining a set of histograms. SASH uses a novel two-phase method to automatically build and maintain itself using query feedback information only. In the online tuning phase, the current set of histograms is tuned in response to the estimation error of each query in an online manner. In the restructuring phase, a new and more accurate set of histograms replaces the current set of histograms. The new set of histograms (attribute sets and memory distribution) is found using information from a batch of query feedback. We present experimental results that show the effectiveness and accuracy of our approach. VLDB Multiscale Histograms: Summarizing Topological Relations in Large Spatial Datasets. Xuemin Lin,Qing Liu,Yidong Yuan,Xiaofang Zhou 2003 Summarizing topological relations is fundamental to many spatial applications including spatial query optimization. In this paper, we present several novel techniques to effectively construct cell density based spatial histograms for range (window) summarizations restricted to the four most important topological relations: contains, contained, overlap, and disjoint. We first present a novel framework to construct a multiscale histogram composed of multiple Euler histograms with the guarantee of the exact summarization results for aligned windows in constant time. Then we present an approximate algorithm, with the approximate ratio 19/12, to minimize the storage spaces of such multiscale Euler histograms, although the problem is generally NP-hard. To conform to a limited storage space where only k Euler histograms are allowed, an effective algorithm is presented to construct multiscale histograms to achieve high accuracy. Finally, we present a new approximate algorithm to query an Euler histogram that cannot guarantee the exact answers; it runs in constant time. Our extensive experiments against both synthetic and real world datasets demonstrated that the approximate multiscale histogram techniques may improve the accuracy of the existing techniques by several orders of magnitude while retaining the cost efficiency, and the exact multiscale histogram technique requires only a storage space linearly proportional to the number of cells for the real datasets. VLDB Capturing Global Transactions from Multiple Recovery Log Files in a Partitioned Database System. Chengfei Liu,Bruce G. Lindsay,Serge Bourbonnais,Elizabeth Hamel,Tuong C. Truong,Jens Stankiewitz 2003 "DB2 DataPropagator is one of the IBM's solutions for asynchronous replication of relational data by two separate programs Capture and Apply. The Capture program captures changes made to source data from recovery log files into staging tables, while the Apply program applies the changes from the staging tables to target data. Currently the Capture program only supports capturing changes made by local transactions in a single database log file. With the increasing deployment of partitioned database systems in OLTP environments there is a need to replicate the operational data from the partitioned systems. This paper introduces a system called CaptureEEE which extends the Capture program to capture global transactions executed on partitioned databases supported by DB2 Enterprise-Extended Edition. The architecture and the components of CaptureEEE are presented. The algorithm for merging log entries from multiple recovery log files is discussed in detail." VLDB Optimized Query Execution in Large Search Engines with Global Page Ordering. Xiaohui Long,Torsten Suel 2003 Large web search engines have to answer thousands of queries per second with interactive response times. A major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. To address this issue, IR and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. Over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. We focus on the question of how such techniques can be efficiently integrated into query processing. In particular, we study pruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by Pagerank or any other method, in addition to the standard term-based approach. We describe pruning schemes for this case and evaluate their efficiency on an experimental cluster-based search engine with million web pages. Our results show that there is significant potential benefit in such techniques. VLDB Systematic Development of Data Mining-Based Data Quality Tools. Dominik Lübbers,Udo Grimmer,Matthias Jarke 2003 Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at Daimler-Chrysler shows the usefulness of the approach as a complement to standard data scrubbing. VLDB Locking Protocols for Materialized Aggregate Join Views. Gang Luo,Jeffrey F. Naughton,Curt J. Ellmann,Michael Watzke 2003 The maintenance of materialized aggregate join views is a well-studied problem. However, to date the published literature has largely ignored the issue of concurrency control. Clearly immediate materialized view maintenance with transactional consistency, if enforced by generic concurrency control mechanisms, can result in low levels of concurrency and high rates of deadlock. While this problem is superficially amenable to well-known techniques such as fine-granularity locking and special lock modes for updates that are associative and commutative, we show that these previous techniques do not fully solve the problem. We extend previous high concurrency locking techniques to apply to materialized view maintenance, and show how this extension can be implemented even in the presence of indices on the materialized view. VLDB Composing Mappings Among Data Sources. Jayant Madhavan,Alon Y. Halevy 2003 Semantic mappings between data sources play a key role in several data sharing architectures. Mappings provide the relationships between data stored in different sources, and therefore enable answering queries that require data from other nodes in a data sharing network. Composing mappings is one of the core problems that lies at the heart of several optimization methods in data sharing networks, such as caching frequently traversed paths and redundancy analysis. This paper investigates the theoretical underpinnings of mapping composition. We study the problem for a rich mapping language, GLAV, that combines the advantages of the known mapping formalisms globalas-view and local-as-view. We first show that even when composing two simple GLAV mappings, the full composition may be an infinite set of GLAV formulas. Second, we show that if we restrict the set of queries to be in CQk (a common restriction in practice), then we can always encode the infinite set of GLAV formulas using a finite representation. Furthermore, we describe an algorithm that given a query and a finite encoding of an infinite set of GLAV formulas, finds all the certain answers to the query. Consequently, we show that for a commonly occuring class of queries it is possible to pre-compose mappings, thereby potentially offering significant savings in query processing. VLDB Projecting XML Documents. Amélie Marian,Jérôme Siméon 2003 XQuery is not only useful to query XML in databases, but also to applications that must process XML documents as files or streams. These applications suffer from the limitations of current main-memory XQuery processors which break for rather small documents. In this paper we propose techniques, based on a notion of projection for XML, which can be used to drastically reduce memory requirements in XQuery processors. The main contribution of the paper is a static analysis technique that can identify at compile time which parts of the input document are needed to answer an arbitrary XQuery. We present a loading algorithm that takes the resulting information to build a projected document, which is smaller than the original document, and on which the query yields the same result. We implemented projection in the Galax XQuery processor. Our experiments show that projection reduces memory requirements by a factor of 20 on average, and is effective for a wide variety of queries. In addition, projection results in some speedup during query evaluation. VLDB Integrating Information for On Demand Computing. Nelson Mendonça Mattos 2003 Information integration provides a competitive advantage to businesses and is fundamental to on demand computing. It is strategic area of investment by software companies today whose goal is to provide a unified view of the data regardless of differences in data format, data location and access interfaces, dynamically manage data placement to match availability, currency and performance requirements, and provide autonomic features that reduce the burden on IT staffs for managing complex data architectures. This paper describes the motivation for integrating information for on demand computing, explains its requirements, and illustrates its value through usage scenarios. As shown in the paper, there is still a tremendous amount of research, engineering, and development work needed to make the full information integration vision a reality and it is expected that software companies will continue to heavily invest in aggressively pursing the information integration vision. VLDB OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences. Colin Meek,Jignesh M. Patel,Shruti Kasetty 2003 A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable. The alternative to BLAST is to use an accurate algorithm, such as the Smith-Waterman (S-W) algorithm. However, these accurate algorithms are computationally very expensive, which limits their use in practice. This paper takes on the challenge of designing an accurate and efficient algorithm for evaluating local-alignment searches. To meet this goal, we propose a novel search algorithm, called OASIS. This algorithm employs a dynamic programming A*-search driven by a suffix-tree index that is built on the input data set. We experimentally evaluate OASIS and demonstrate that for an important class of searches, in which the query sequence lengths are small, OASIS is more than an order of magnitude faster than S-W. In addition, the speed of OASIS is comparable to BLAST. Furthermore, OASIS returns results in decreasing order of the matching score, making it possible to use OASIS in an online setting. Consequently, we believe that it may now be practically feasible to query large biological sequence data sets using an accurate local-alignment search algorithm. VLDB OrientStore: A Schema Based Native XML Storage System. Xiaofeng Meng,Daofeng Luo,Mong-Li Lee,Jing An 2003 OrientStore: A Schema Based Native XML Storage System. VLDB Controlling Access to Published Data Using Cryptography. Gerome Miklau,Dan Suciu 2003 "We propose a framework for enforcing access control policies on published XML documents using cryptography. In this framework the owner publishes a single data instance, which is partially encrypted, and which enforces all access control policies. Our contributions include a declarative language for access policies, and the resolution of these policies into a logical ""protection model"" which protects an XML tree with keys. The data owner enforces an access control policy by granting keys to users. The model is quite powerful, allowing the data owner to describe complex access scenarios, and is also quite elegant, allowing logical optimizations to be described as rewriting rules. Finally, we describe cryptographic techniques for enforcing the protection model on published data, and provide a performance analysis using real datasets." VLDB XML Schemas in Oracle XML DB. Ravi Murthy,Sandeepan Banerjee 2003 The W3C XML Scheme language is becomimg increasingly popular for expressing the data model for XML documents. It is a powerful language that incorporates both strutural and datatype modeling features. There are many benefits to storing XML Schema compliant data in a database system, including better queryability, optimied updates and stronger validation. However, the fidelity of the XML document cannot be sacrificed. Thus, the fundamental problem facing database implementers is: how can XML Schemes be mapped to relational (and object-relational) database without losing schema semantics or data-fidelity? In this paper, we present the Oracle XML DB solution for a flexible mapping of XML Schemas to object-relational database. It preserves document fidelity, including ordering, namespaces, comments, processing instructions etc., and handles all the XML Schema semantics including cyclic definitions, dervations (extension and restriction), and wildcards. We also discuss various query and update optimiations that involve rewriting XPath operations to directly operate on the underlying relational data. VLDB IrisNet: An Architecture for Internet-scale Sensing Services. Suman Nath,Amol Deshpande,Yan Ke,Phillip B. Gibbons,Brad Karp,Srinivasan Seshan 2003 "We demonstrate the design and an early prototype of IrisNet (Internet-scale Resource-Intensive Sensor Network services), a common, scalable networked infrastructure for deploying wide area sensing services. IrisNet is a potentially global network of smart sensing nodes, with webcams or other monitoring devices, and organizing nodes that provide the means to query recent and historical sensor-based data. IrisNet exploits the fact that high-volume sensor feeds are typically attached to devices with significant computing power and storage, and running a standard operating system. It uses aggressive filtering, smart query routing, and semantic caching to dramatically reduce network bandwidth utilization and improve query response times, as we demonstrate. Our demo will present two services built on Iris-Net, from two very different application domains. The first one, a parking space finder, utilizes webcams that monitor parking spaces to answer queries such as the availability of parking spaces near a user's destination. The second one, a distributed infrastructure monitor, uses measurement tools installed in individual nodes of a large distributed infrastructure to answer queries such as average network bandwidth usage of a set of nodes." VLDB NexusScout: An Advanced Location-Based Application on a Distributed, Open Mediation Platform. Daniela Nicklas,Matthias Großmann,Thomas Schwarz 2003 This demo shows several advanced use cases of location-based services and demonstrates how these use cases are facilitated by a mediation middleware for spatial information, the Nexus Platform. The scenario shows how a mobile user can access location-based information via so called Virtual Information Towers, register spatial events, send and receive geographical messages or find her friends by displaying other mobile users. The platform facilitates these functions by transparently combining spatial data from a dynamically changing set of data providers, tracking mobile objects and observing registered spatial events. VLDB BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration. Zaiqing Nie,Subbarao Kambhampati,Thomas Hernandez 2003 Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. In this paper we present StatMiner, a system for estimating the coverage and overlap statistics while keeping the needed statistics tightly under control. StatMiner uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We will demonstrate the major functionalities of StatMiner and the effectiveness of the learned statistics in BibFinder, a publicly available computer science bibliography mediator we developed. The sources that BibFinder integrates are autonomous and can have uncontrolled coverage and overlap. An important focus in BibFinder was thus to mine coverage and overlap statistics about these sources and to exploit them to improve query processing. VLDB The TPR*-Tree: An Optimized Spatio-Temporal Access Method for Predictive Queries. Yufei Tao,Dimitris Papadias,Jimeng Sun 2003 A predictive spatio-temporal query retrieves the set of moving objects that will intersect a query window during a future time interval. Currently, the only access method for processing such queries in practice is the TPR-tree. In this paper we first perform an analysis to determine the factors that affect the performance of predictive queries and show that several of these factors are not considered by the TPR-tree, which uses the insertion/deletion algorithms of the R*-tree designed for static data. Motivated by this, we propose a new index structure called the TPR*- tree, which takes into account the unique features of dynamic objects through a set of improved construction algorithms. In addition, we provide cost models that determine the optimal performance achievable by any data-partition spatio-temporal access method. Using experimental comparison, we illustrate that the TPR*-tree is nearly-optimal and significantly outperforms the TPR-tree under all conditions. VLDB Query Processing in Spatial Network Databases. Dimitris Papadias,Jun Zhang,Nikos Mamoulis,Yufei Tao 2003 Despite the importance of spatial networks in real-life applications, most of the spatial database literature focuses on Euclidean spaces. In this paper we propose an architecture that integrates network and Euclidean information, capturing pragmatic constraints. Based on this architecture, we develop a Euclidean restriction and a network expansion framework that take advantage of location and connectivity to efficiently prune the search space. These frameworks are successfully applied to the most popular spatial queries, namely nearest neighbors, range search, closest pairs and e-distance joins, in the context of spatial network databases. VLDB Adaptive, Hands-Off Stream Mining. Spiros Papadimitriou,Anthony Brockwell,Christos Faloutsos 2003 "Sensor devices and embedded processors are becoming ubiquitous. Their limited resources (CPU, memory and/or communication bandwidth and power) pose some interesting challenges. We need both powerful and concise ""languages"" to represent the important features of the data, which can (a) adapt and handle arbitrary periodic components, including bursts, and (b) require little memory and a single pass over the data. We propose AWSOM (Arbitrary Window Stream mOdeling Method), which allows sensors in remote or hostile environments to efficiently and effectively discover interesting patterns and trends. This can be done automatically, i.e., with no user intervention and expert tuning before or during data gathering. Our algorithms require limited resources and can thus be incorporated in sensors, possibly alongside a distributed query processing engine [9, 5, 22]. Updates are performed in constant time, using logarithmic space. Existing, state of the art forecasting methods (SARIMA, GARCH, etc) fall short on one or more of these requirements. To the best of our knowledge, AWSOM is the first method that has all the above characteristics. Experiments on real and synthetic datasets demonstrate that AWSOM discovers meaningful patterns over long time periods. Thus, the patterns can also be used to make long-range forecasts, which are notoriously difficult to perform. In fact, AWSOM outperforms manually set up auto-regressive models, both in terms of long-term pattern detection and modeling, as well as by at least 10× in resource consumption." VLDB S-ToPSS: Semantic Toronto Publish/Subscribe System. Milenko Petrovic,Ioana Burcea,Hans-Arno Jacobsen 2003 S-ToPSS: Semantic Toronto Publish/Subscribe System. VLDB Data Compression in Oracle. Meikel Pöss,Dmitry Potapov 2003 "The Oracle RDBMS recently introduced an innovative compression technique for reducing the size of relational tables. By using a compression algorithm specifically designed for relational data, Oracle is able to compress data much more effectively than standard compression techniques. More significantly, unlike other compression techniques, Oracle incurs virtually no performance penalty for SQL queries accessing compressed tables. In fact, Oracle's compression may provide performance gains for queries accessing large amounts of data, as well as for certain data management operations like backup and recovery. Oracle's compression algorithm is particularly well-suited for data warehouses: environments, which contains large volumes of historical data, with heavy query workloads. Compression can enable a data warehouse to store several times more raw data without increasing the total disk storage or impacting query performance." VLDB Merging Models Based on Given Correspondences. Rachel Pottinger,Philip A. Bernstein 2003 A model is a formal description of a complex application artifact, such as a database schema, an application interface, a UML model, an ontology, or a message format. The problem of merging such models lies at the core of many meta data applications, such as view integration, mediated schema creation for data integration, and ontology merging. This paper examines the problem of merging two models given correspondences between them. It presents requirements for conducting a merge and a specific algorithm that subsumes previous work. VLDB The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces. Gang Qian,Qiang Zhu,Qiang Xue,Sakti Pramanik 2003 Similarity searches in multidimensional Nonordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as genome sequence databases. Existing indexing methods developed for multidimensional (ordered) Continuous Data Spaces (CDS) such as R-tree cannot be directly applied to an NDDS. This is because some essential geometric concepts/properties such as the minimum bounding region and the area of a region in a CDS are no longer valid in an NDDS. On the other hand, indexing methods based on metric spaces such as M-tree are too general to effectively utilize the data distribution characteristics in an NDDS. Therefore, their retrieval performance is not optimized. To support efficient similarity searches in an NDDS, we propose a new dynamic indexing technique, called the ND-tree. The key idea is to extend the relevant geometric concepts as well as some indexing strategies used in CDSs to NDDSs. Efficient algorithms for ND-tree construction are presented. Our experimental results on synthetic and genomic sequence data demonstrate that the performance of the ND-tree is significantly better than that of the linear scan and M-tree in high dimensional NDDSs. VLDB Complex Queries over Web Repositories. Sriram Raghavan,Hector Garcia-Molina 2003 "Web repositories, such as the Stanford WebBase repository, manage large heterogeneous collections of Web pages and associated indexes. For effective analysis and mining, these repositories must provide a declarative query interface that supports complex expressive Web queries. Such queries have two key characteristics: (i) They view a Web repository simultaneously as a collection of text documents, as a navigable directed graph, and as a set of relational tables storing properties of Web pages (length, URL, title, etc.). (ii) The queries employ application-specific ranking and ordering relationships over pages and links to filter out and retrieve only the ""best"" query results. In this paper, we model a Web repository in terms of ""Web relations"" and describe an algebra for expressing complex Web queries. Our algebra extends traditional relational operators as well as graph navigation operators to uniformly handle plain, ranked, and ordered Web relations. In addition, we present an overview of the cost-based optimizer and execution engine that we have developed, to efficiently execute Web queries over large repositories." VLDB Covering Indexes for XML Queries: Bisimulation - Simulation = Negation. Prakash Ramanan 2003 Tree Pattern Queries (TPQ), Branching Path Queries (BPQ), and Core XPath (CXPath) are subclasses of the XML query language XPath, TPQ ⊂ BPQ ⊂ CX Path ⊂ X Path. Let TPQ = TPQ+ ⊂ BPQ+ ⊂ CX Path+ ⊂ X Path+ denote the corresponding subclasses, consisting of queries that do not involve the boolean negation operator not in their predicates. Simulation and bisimulation are two different binary relations on graph vertices that have previously been studied in connection with some of these classes. For instance, TPQ queries can be minimized using simulation. Most relevantly, for an XML document, its bisimulation quotient is the smallest index that covers (i.e., can be used to answer) all BPQ queries. Our results are as follows: • A CXPath+ query can be evaluated on an XML document by computing the simulation of the query tree by the document graph. • For an XML document, its simulation quotient is the smallest covering index for BPQ+. This, together with the previously-known result stated above, leads to the following: For BPQ covering indexes of XML documents, Bisimulation - Simulation = Negation. • For an XML document, its simulation quotient, with the idref edges ignored throughout, is the smallest covering index for TPQ. For any XML document, its simulation quotient is never larger than its bisimulation quotient; in some instances, it is exponentially smaller. Our last two results show that disallowing negation in the queries could substantially reduce the size of the smallest covering index. VLDB QUIET: Continuous Query-driven Index Tuning. Kai-Uwe Sattler,Ingolf Geist,Eike Schallehn 2003 "Index tuning as part of database tuning is the task of selecting and creating indexes with the goal of reducing query processing times. However, in dynamic environments with various ad-hoc queries it is difficult to identify potential useful indexes in advance. In this demonstration, we present our tool QUIET addressing this problem. This tool ""intercepts"" queries and - based on a cost model as well as runtime statistics about profits of index configurations - decides about index creation automatically at runtime. In this way, index tuning is driven by queries without explicit actions of the database users." VLDB Lachesis: Robust Database Storage Management Based on Device-specific Performance Characteristics. Jiri Schindler,Anastassia Ailamaki,Gregory R. Ganger 2003 "Database systems work hard to tune I/O performance, but do not always achieve the full performance potential of modern disk systems. Their abstracted view of storage components hides useful device-specific characteristics, such as disk track boundaries and advanced built-in firmware algorithms. This paper presents a new storage manager architecture, called Lachesis, that exploits and adapts to observable device-specific characteristics in order to achieve and sustain high performance. For DSS queries, Lachesis achieves I/O efficiency nearly equivalent to sequential streaming even in the presence of competing random I/O traffic. In addition, Lachesis simplifies manual configuration and restores the optimizer's assumptions about the relative costs of different access patterns expressed in query plans. Experiments using IBM DB2 I/O traces as well as a prototype implementation show that Lachesis improves standalone DSS performance by 10% on average. More importantly, when running concurrently with an on-line transaction processing (OLTP) workload, Lachesis improves DSS performance by up to 3×, while OLTP also exhibits a 7% speedup." VLDB Commercial Use of Database Technology. Harald Schöning 2003 "This session provides insight into two different European products for the E-commerce/database market. First, Martin Meijsen, Software AG, gives an overview on Tamino, Software AG's XML DBMS while the second presentation by Eva Kühn of TECCO AG, Austria discusses technical details of products that provides ""zero-delay access to data warehouses""," VLDB An Efficient and Resilient Approach to Filtering and Disseminating Streaming Data. Shetal Shah,Shyamshankar Dharmarajan,Krithi Ramamritham 2003 Many web users monitor dynamic data such as stock prices, real-time sensor data and traffic data for making on-line decisions. Instances of such data can be viewed as data streams. In this paper, we consider techniques for creating a resilient and efficient content distribution network for such dynamically changing streaming data. We address the problem of maintaining the coherency of dynamic data items in a network of repositories: data disseminated to one repository is filtered by that repository and disseminated to repositories dependent on it. Our method is resilient to link failures and repository failures. This resiliency implies that data fidelity is not lost even when the repository from which (or a communication path through which) a user obtains data experiences failures. Experimental evaluation, using real world traces of streaming data, demonstrates that (i) the (computational and communication) cost of adding this redundancy is low, and (ii) surprisingly, in many cases, adding resiliency enhancing features actually improves the fidelity provided by the system even in cases when there are no failures. To further enhance fidelity, we also propose efficient techniques for filtering data arriving at one repository and for scheduling the dissemination of filtered data to another repository. Our results show that the combination of resiliency enhancing and efficiency improving techniques in fact help derive the potential that push based systems are said to have in delivering 100% fidelity. Without them, computational and communication delays inherent in dissemination networks can lead to a large fidelity loss even in push based dissemination. VLDB The Data-Centric Revolution in Networking. Scott Shenker 2003 Historically, there has been little overlap between the database and networking research communities; they operate on very different levels and focus on very different issues. While this strict separation of concerns has lasted for many years, in this talk I will argue that the gap has recently narrowed to the point where the two fields now have much to say to each other. Networking research has traditionally focused on enabling communication between network hosts. This research program has produced a myriad of specific algorithms and protocols to solve such problems as error recovery, congestion control, routing, multicast and quality-of-service. It has also led to a set of general architectural principles, such as fate sharing and the end-to-end principle, that provide widely applicable guidelines for allocating functionality among network entities. VLDB A Shrinking-Based Approach for Multi-Dimensional Data Analysis. Yong Shi,Yuqing Song,Aidong Zhang 2003 "Existing data analysis techniques have difficulty in handling multi-dimensional data. In this paper, we first present a novel data preprocessing technique called shrinking which optimizes the inner structure of data inspired by the Newton's Universal Law of Gravitation[22] in the real world. This data reorganization concept can be applied in many fields such as pattern recognition, data clustering and signal processing. Then, as an important application of the data shrinking preprocessing, we propose a shrinking-based approach for multi-dimensional data analysis which consists of three steps: data shrinking, cluster detection, and cluster evaluation and selection. The process of data shrinking moves data points along the direction of the density gradient, thus generating condensed, widely-separated clusters. Following data shrinking, clusters are detected by finding the connected components of dense cells. The data-shrinking and cluster-detection steps are conducted on a sequence of grids with different cell sizes. The clusters detected at these scales are compared by a cluster-wise evaluation measurement, and the best clusters are selected as the final result. The experimental results show that this approach can effectively and efficiently detect clusters in both low- and high-dimensional spaces." VLDB From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation. Sergej Sizov,Jens Graupmann,Martin Theobald 2003 From Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation. VLDB A Platform Based on the Multi-dimensional Data Model for Analysis of Bio-Molecular Structures. Srinath Srinivasa,Sujit Kumar 2003 "A platform called AnMol for supporting analytical applications over structural data of large biomolecules is described. The term ""biomolecular structure"" has various connotations and different representations. AnMol reduces these representations into graph structures. Each of these graphs are then stored as one or more vectors in a database. Vectors encapsulate structural features of these graphs. Structural queries like similarity and substructure are transformed into spatial constructs like distance and containment within regions. Query results are based on inexact matches. A refinement mechanism is supported for increasing accuracy of the results. Design and implementation issues of AnMol including schema structure and performance results are discussed in this paper." VLDB Load Shedding in a Data Stream Manager. Nesime Tatbul,Ugur Çetintemel,Stanley B. Zdonik,Mitch Cherniack,Michael Stonebraker 2003 A Data Stream Manager accepts push-based inputs from a set of data sources, processes these inputs with respect to a set of standing queries, and produces outputs based on Quality-of-Service (QoS) specifications. When input rates exceed system capacity, the system will become overloaded and latency will deteriorate. Under these conditions, the system will shed load, thus degrading the answer, in order to improve the observed latency of the results. This paper examines a technique for dynamically inserting and removing drop operators into query plans as required by the current load. We examine two types of drops: the first drops a fraction of the tuples in a randomized fashion, and the second drops tuples based on the importance of their content. We address the problems of determining when load shedding is needed, where in the query plan to insert drops, and how much of the load should be shed at that point in the plan. We describe efficient solutions and present experimental evidence that they can bring the system back into the useful operating range with minimal degradation in answer quality. VLDB A Regression-Based Temporal Pattern Mining Scheme for Data Streams. Wei-Guang Teng,Ming-Syan Chen,Philip S. Yu 2003 We devise in this paper a regression-based algorithm, called algorithm FTP-DS (Frequent Temporal Patterns of Data Streams), to mine frequent temporal patterns for data streams. While providing a general framework of pattern frequency counting, algorithm FTP-DS has two major features, namely one data scan for online statistics collection and regression-based compact pattern representation.To attain the feature of one data scan, the data segmentation and the pattern growth scenarios are explored for the frequency counting purpose. Algorithm FTP-DS scans online transaction flows and generates candidate frequent patterns in real time. The second important feature of algorithm FTP-DS is on the regression-based compact pattern representation. Specifically, to meet the space constraint, we devise for pattern representation a compact ATF (standing for Accumulated Time and Frequency) form to aggregately comprise all the information required for regression analysis. In addition, we develop the techniques of the segmentation tuning and segment relaxation to enhance the functions of FTP-DS. With these features, algorithm FTP-DS is able to not only conduct mining with variable time intervals but also perform trend detection effectively. Synthetic data and a real dataset which contains net-Permission work alarm logs from a major telecommunication company are utilized to verify the feasibility of algorithm FTP-DS. VLDB Tuple Routing Strategies for Distributed Eddies. Feng Tian,David J. DeWitt 2003 Many applications that consist of streams of data are inherently distributed. Since input stream rates and other system parameters such as the amount of available computing resources can fluctuate significantly, a stream query plan must be able to adapt to these changes. Routing tuples between operators of a distributed stream query plan is used in several data stream management systems as an adaptive query optimization technique. The routing policy used can have a significant impact on system performance. In this paper, we use a queuing network to model a distributed stream query plan and define performance metrics for response time and system throughput. We also propose and evaluate several practical routing policies for a distributed stream management system. The performance results of these policies are compared using a discrete event simulator. Finally, we study the impact of the routing policy on system throughput and resource allocation when computing resources can be shared between operators. VLDB Chameleon: an Extensible and Customizable Tool for Web Data Translation. Riccardo Torlone,Paolo Atzeni 2003 "Chameleon is a tool for the management of Web data according to different formats and models and for the automatic transformation of schemas and instances from one model to another. It handles semistructured data, schema languages for XML, and traditional database models. The system is based on a ""metamodel"" approach, in the sense that it knows a set of metaconstructs,and allows the definition of models by means of the involved metaconstructs. The system also has a library of basic translations, referring to the known metaconstructs, and builds actual translations by means of suitable combinations of the basic ones. The main functions offered to the user are: (i) definition of a model; (ii)definition and validation of a schema with respect to a given model; (iii)schema translation (from a model to another)." VLDB The Generalized Pre-Grouping Transformation: Aggregate-Query Optimization in the Presence of Dependencies. Aris Tsois,Timos K. Sellis 2003 One of the recently proposed techniques for the efficient evaluation of OLAP aggregate queries is the usage of clustering access methods. These methods store the fact table of a data warehouse clustered according to the dimension hierarchies using special attributes called hierarchical surrogate keys. In the presence of these access methods new processing and optimization techniques have been recently proposed. One important such optimization technique, called Hierarchical Pre-Grouping, uses the hierarchical surrogate keys in order to aggregate the fact table tuples as early as possible and to avoid redundant joins. In this paper, we study the Pre-Grouping transformation, attempting to generalize its applicability and identify its relationship to other similar transformations. Our results include a general algebraic definition of the Pre-Grouping transformation along with the formal definition of sufficient conditions for applying the transformation. Using a provided theorem we show that Pre-Grouping can be applied in the presence of functional and inclusion dependencies without the explicit usage of hierarchical surrogate keys. An additional result of our study is the definition of the Surrogate-Join transformation that can modify a join condition using a number of dependencies. To our knowledge, Surrogate-Join does not belong to any of the Semantic Query Transformation types discussed in the past. VLDB Mapping Adaptation under Evolving Schemas. Yannis Velegrakis,Renée J. Miller,Lucian Popa 2003 To achieve interoperability, modern information systems and e-commerce applications use mappings to translate data from one representation to another. In dynamic environments like the Web, data sources may change not only their data but also their schemas, their semantics, and their query capabilities. Such changes must be reflected in the mappings. Mappings left inconsistent by a schema change have to be detected and updated. As large, complicated schemas become more prevalent, and as data is reused in more applications, manually maintaining mappings (even simple mappings like view definitions) is becoming impractical. We present a novel framework and a tool (ToMAS) for automatically adapting mappings as schemas evolve. Our approach considers not only local changes to a schema, but also changes that may affect and transform many components of a schema. We consider a comprehensive class of mappings for relational and XML schemas with choice types and (nested) constraints. Our algorithm detects mappings affected by a structural or constraint change and generates all the rewritings that are consistent with the semantics of the mapped schemas. Our approach explicitly models mapping choices made by a user and maintains these choices, whenever possible, as the schemas and mappings evolve. We describe an implementation of a mapping management and adaptation tool based on these ideas and compare it with a mapping generation tool. VLDB A Dependability Benchmark for OLTP Application Environments. Marco Vieira,Henrique Madeira 2003 The ascendance of networked information in our economy and daily lives has increased the awareness of the importance of dependability features. OLTP (On-Line Transaction Processing) systems constitute the kernel of the information systems used today to support the daily operations of most of the business. Although these systems comprise the best examples of complex business-critical systems, no practical way has been proposed so far to characterize the impact of faults in such systems or to compare alternative solutions concerning dependability features. This paper proposes a dependability benchmark for OLTP systems. This dependability benchmark uses the workload of the TPC-C performance benchmark and specifies the measures and all the steps required to evaluate both the performance and key dependability features of OLTP systems, with emphasis on availability. This dependability benchmark is presented through a concrete example of benchmarking the performance and dependability of several different transactional systems configurations. The effort required to run the dependability benchmark is also discussed in detail. VLDB Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. Stratis Viglas,Jeffrey F. Naughton,Josef Burger 2003 "Recently there has been a growing interest in join query evaluation for scenarios in which inputs arrive at highly variable and unpredictable rates. In such scenarios, the focus shifts from completing the computation as soon as possible to producing a prefix of the output as soon as possible. To handle this shift in focus, most solutions to date rely upon some combination of streaming binary operators and ""on-the-fly"" execution plan reorganization. In contrast, we consider the alternative of extending existing symmetric binary join operators to handle more than two inputs. Toward this end, we have completed a prototype implementation of a multi-way join operator, which we term the ""MJoin"" operator, and explored its performance. Our results show that in many instances the MJoin produces outputs sooner than any tree of binary operators. Additionally, since MJoins are completely symmetric with respect to their inputs, they can reduce the need for expensive runtime plan reorganization. This suggests that supporting multiway joins in a single, symmetric, streaming operator may be a useful addition to systems that support queries over input streams from remote sites." VLDB Avoiding Ordering and Grouping In Query Processing. Xiaoyu Wang,Mitch Cherniack 2003 Avoiding Ordering and Grouping In Query Processing. VLDB An Interpolated Volume Data Model. Tianqiu Wang,Simone Santini,Amarnath Gupta 2003 An Interpolated Volume Data Model. VLDB ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams. Haixun Wang,Carlo Zaniolo,Chang Luo 2003 ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams. VLDB "WebService Composition with O'GRAPE and OSIRIS." Roger Weber,Christoph Schuler,Patrick Neukomm,Heiko Schuldt,Hans-Jörg Schek 2003 "WebService Composition with O'GRAPE and OSIRIS." VLDB Business Modelling Using SQL Spreadsheets. Andrew Witkowski,Srikanth Bellamkonda,Tolga Bozkaya,Nathan Folkert,Abhinav Gupta,Lei Sheng,Sankar Subramanian 2003 One of the critical deficiencies of SQL is the lack of support for array and spreadsheet like calculations which are frequent in OLAP and Business Modeling applications. Applications relying on SQL have to emulate these calculations using joins, UNION operations, Window Functions and complex CASE expressions. The designated place in SQL for algebraic calculations is the SELECT clause, which is extremely limiting and forces applications to generate queries with nested views, subqueries and complex joins. This distributes Business Modeling computations across many query blocks, making applications coded in SQL hard to develop. The limitations of RDBMS have been filled by spreadsheets and specialized MOLAP engines which are good at formulas for mathematical modeling but lack the formalism of the relational model, are difficult to manage, and exhibit scalability problems. This demo presents a scalable, mathematically rigorous, and performant SQL extensions for Relational Business Modeling, called the SQL Spreadsheet. We present examples of typical Business Modeling computations with SQL spreadsheet and compare them with the ones using standard SQL showing performance advantages and ease of programming for the former. We will show a scalability example where data is processed in parallel and will present a new class of query optimizations applicable to SQL spreadsheet. VLDB Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration. Dong Xin,Jiawei Han,Xiaolei Li,Benjamin W. Wah 2003 "Data cube computation is one of the most essential but expensive operations in data warehousing. Previous studies have developed two major approaches, top-down vs. bottom-up. The former, represented by the Multi-Way Array Cube (called MultiWay) algorithm [25], aggregates simultaneously on multiple dimensions; however, it cannot take advantage of Apriori pruning [2] when computing iceberg cubes (cubes that contain only aggregate cells whose measure value satisfies a threshold, called iceberg condition). The latter, represented by two algorithms: BUC [6] and H-Cubing[11], computes the iceberg cube bottom-up and facilitates Apriori pruning. BUC explores fast sorting and partitioning techniques; whereas H-Cubing explores a data structure, H-Tree, for shared computation. However, none of them fully explores multi-dimensional simultaneous aggregation. In this paper, we present a new method, Star-Cubing, that integrates the strengths of the previous three algorithms and performs aggregations on multiple dimensions simultaneously. It utilizes a star-tree structure, extends the simultaneous aggregation methods, and enables the pruning of the group-by's that do not satisfy the iceberg condition. Our performance study shows that Star-Cubing is highly efficient and outperforms all the previous methods in almost all kinds of data distributions." VLDB Efficient Mining of XML Query Patterns for Caching. Liang Huai Yang,Mong-Li Lee,Wynne Hsu 2003 As XML becomes ubiquitous, the efficient retrieval of XML data becomes critical. Research to improve query response time has been largely concentrated on indexing paths, and optimizing XML queries. An orthogonal approach is to discover frequent XML query patterns and cache their results to improve the performance of XML management systems. In this paper, we present an efficient algorithm called FastXMiner, to discover frequent XML query patterns. We develop theorems to prove that only a small subset of the generated candidate patterns needs to undergo expensive tree containment tests. In addition, we demonstrate how the frequent query patterns can be used to improve caching performance. Experiments results show that FastXMiner is efficient and scalable, and caching the results of frequent patterns significantly improves the query response time. VLDB Tabular Placement of Relational Data on MEMS-based Storage Devices. Hailing Yu,Divyakant Agrawal,Amr El Abbadi 2003 Due to the advances in semiconductor manufacturing, the gap between main memory and secondary storage is constantly increasing. This becomes a significant performance bottleneck for Database Management Systems, which rely on secondary storage heavily to store large datasets. Recent advances in nanotechnology have led to the invention of alternative means for persistent storage. In particular, MicroElectroMechanical Systems (MEMS) based storage technology has emerged as the leading candidate for next generation storage systems. In order to integrate MEMS-based storage into conventional computing platform, new techniques are needed for I/O scheduling and data placement. In the context of relational data, it has been observed that access to relations needs to be enabled in both row-wise as well as in columnwise fashions. In this paper, we exploit the physical characteristics of MEMS-based storage devices to develop a data placement scheme for relational data that enables retrieval in both row-wise and column-wise manner. We demonstrate that this data layout not only improves I/O utilization, but results in better cache performance. VLDB Buffering Accesses to Memory-Resident Index Structures. Jingren Zhou,Kenneth A. Ross 2003 "Recent studies have shown that cache-conscious indexes outperform conventional main memory indexes. Cache-conscious indexes focus on better utilization of each cache line for improving search performance of a single lookup. None has exploited cache spatial and temporal locality between consecutive lookups. We show that conventional indexes, even ""cache-conscious"" ones, suffer from significant cache thrashing between accesses. Such thrashing can impact the performance of applications such as stream processing and query operations such as index-nested-loops join. We propose techniques to buffer accesses to memory-resident tree-structured indexes to avoid cache thrashing. We study several alternative designs of the buffering technique, including whether to use fixed-size or variable-sized buffers, whether to buffer at each tree level or only at some of the levels, how to support bulk access while there are concurrent updates happening to the index, and how to preserve the order of the incoming lookups in the output results. Our methods improve cache performance for both cache-conscious and conventional index structures. Our experiments show that buffering techniques enable a probe throughput that is two to three times higher than traditional methods." VLDB Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces. Jianjun Zhou,Jörg Sander 2003 To speed-up clustering algorithms, data summarization methods have been proposed, which first summarize the data set by computing suitable representative objects. Then, a clustering algorithm is applied to these representatives only, and a clustering structure for the whole data set is derived, based on the result for the representatives. Most previous methods are, however, limited in their application domain. They are in general based on sufficient statistics such as the linear sum of a set of points, which assumes that the data is from a vector space. On the other hand, in many important applications, the data is from a metric non-vector space, and only distances between objects can be exploited to construct effective data summarizations. In this paper, we develop a new data summarization method based only on distance information that can be applied directly to non-vector data. An extensive performance evaluation shows that our method is very effective in finding the hierarchical clustering structure of non-vector data using only a very small number of data summarizations, thus resulting in a large reduction of runtime while trading only very little clustering quality. VLDB Distributed Top-N Query Processing with Possibly Uncooperative Local Systems. Clement T. Yu,George Philip,Weiyi Meng 2003 We consider the problem of processing top-N queries in a distributed environment with possibly uncooperative local database systems. For a given top-N query, the problem is to find the N tuples that satisfy the query the best but not necessarily completely in an efficient manner. Top-N queries are gaining popularity in relational databases and are expected to be very useful for e-commerce applications. Many companies provide the same type of goods and services to the public on the Web, and relational databases may be employed to manage the data. It is not feasible for a user to query a large number of databases. It is therefore desirable to provide a facility where a user query is accepted at some site, suitable tuples from appropriate sites are retrieved and the results are merged and then presented to the user. In this paper, we present a method for constructing the desired facility. Our method consists of two steps. The first step determines which databases are likely to contain the desired tuples for a given query so that the databases can be ranked based on their desirability with respect to the query. Four different techniques are introduced for this step with one requiring no cooperation from local systems. The second step determines how the ranked databases should be searched and what tuples from the searched databases should be returned. A new algorithm is proposed for this purpose. Experimental results are presented to compare different methods and very promising results are obtained using the method that requires no cooperation from local databases. SIGMOD Record Peer-to-peer research at Stanford. Mayank Bawa,Brian F. Cooper,Arturo Crespo,Neil Daswani,Prasanna Ganesan,Hector Garcia-Molina,Sepandar D. Kamvar,Sergio Marti,Mario T. Schlosser,Qi Sun,Patrick Vinograd,Beverly Yang 2003 Peer-to-peer research at Stanford. SIGMOD Record Call for Book Reviews. Karl Aberer 2003 Call for Book Reviews. SIGMOD Record Report on the First International Conference on Ontologies, Databases and Applications of Semantics. Karl Aberer 2003 Report on the First International Conference on Ontologies, Databases and Applications of Semantics. SIGMOD Record Book review column. Karl Aberer 2003 Book review column. SIGMOD Record "Guest editor's introduction." Karl Aberer 2003 "Guest editor's introduction." SIGMOD Record Book review column. Karl Aberer 2003 Book review column. SIGMOD Record P-Grid: a self-organizing structured P2P system. Karl Aberer,Philippe Cudré-Mauroux,Anwitaman Datta,Zoran Despotovic,Manfred Hauswirth,Magdalena Punceva,Roman Schmidt 2003 P-Grid: a self-organizing structured P2P system. SIGMOD Record An environmental sensor network to determine drinking water quality and security. Anastassia Ailamaki,Christos Faloutsos,Paul S. Fischbeck,Mitchell J. Small,Jeanne M. VanBriesen 2003 Finding patterns in large, real, spatio/temporal data continues to attract high interest (e.g., sales of products over space and time, patterns in mobile phone users; sensor networks collecting operational data from automobiles, or even from humans with wearable computers). In this paper, we describe an interdisciplinary research effort to couple knowledge discovery in large environmental databases with biological and chemical sensor networks, in order to revolutionize drinking water quality and security decision making. We describe a distribution and operation protocol for the placement and utilization of in situ environmental sensors by combining (1) new algorithms for spatialtemporal data mining, (2) new methods to model water quality and security dynamics, and (3) a sophisticated decision-analysis framework. The project was recently funded by NSF and represents application of these research areas to the critical current issue of ensuring safe and secure drinking water to the population of the United States. SIGMOD Record A Database Approach to Quality of Service Specification in Video Databases. Elisa Bertino,Ahmed K. Elmagarmid,Mohand-Said Hacid 2003 Quality of Service (QoS) is defined as a set of perceivable attributes expressed in a user-friendly language with parameters that may be subjective or objective. Objective parameters are those related to a particular service and are measurable and verifiable. Subjective parameters are those based on the opinions of the end-users. We believe that quality of service should become an integral part of multimedia database systems and users should be able to query by requiring a quality of service from the system. The specification and enforcement of QoS presents an interesting challenge in multimedia systems development. A deal of effort has been done on QoS specification and control at the system and the network levels, but less work has been done at the application/user level. In this paper, we propose a language, in the style of constraint database languages, for formal specification of QoS constraints. The satisfaction by the system of the user quality requirements can be viewed as a constraint satisfaction problem. We believe this paper represents a first step towards the development of a database framework for quality of service management in video databases. The contribution of this paper lies in providing a logical framework for specifying and enforcing quality of service in video databases. To our knowledge, this work is the first from a database perspective on quality of service management. SIGMOD Record Time management for new faculty. Anastassia Ailamaki,Johannes Gehrke 2003 In this article, we describe techniques for time management for new faculty members, covering a wide range of topics ranging from advice on scheduling meetings, email, to writing grant proposals and teaching. SIGMOD Record Exposing undergraduate students to database system internals. Anastassia Ailamaki,Joseph M. Hellerstein 2003 In Spring 2003, Joe Hellerstein at Berkeley and Natassa Ailamaki at CMU collaborated in designing and running parallel editions of an undergraduate database course that exposed students to developing code in the core of a ful-function database system. As part of this exercise, our course teams developed new programming projects based on the PostgreSQL open-source DBMS. This report describes our experience with this effort. SIGMOD Record Review of The data warehouse toolkit: the complete guide to dimensional modeling (2nd edition) by Ralph Kimball, Margy Ross. John Wiley & Sons, Inc. 2002. Alexander A. Anisimov 2003 Review of The data warehouse toolkit: the complete guide to dimensional modeling (2nd edition) by Ralph Kimball, Margy Ross. John Wiley & Sons, Inc. 2002. SIGMOD Record The hyperion project: from data integration to data coordination. Marcelo Arenas,Vasiliki Kantere,Anastasios Kementsietsidis,Iluju Kiringa,Renée J. Miller,John Mylopoulos 2003 We present an architecture and a set of challenges for peer database management systems. These systems team up to build a network of nodes (peers) that coordinate at run time most of the typical DBMS tasks such as the querying, updating, and sharing of data. Such a network works in a way similar to conventional multidatabases. Conventional multidatabase systems are founded on key concepts such as those of a global schema, central administrative authority, data integration, global access to multiple databases, permanent participation of databases, etc. Instead, our proposal assumes total absence of any central authority or control, no global schema, transient participation of peer databases, and constantly evolving coordination rules among databases. In this work, we describe the status of the Hyperion project, present our current solutions, and outline remaining research issues. SIGMOD Record Bluetooth-based sensor networks. Philippe Bonnet,Allan Beaufour,Mads Bondo Dydensborg,Martin Leopold 2003 It is neither desirable nor possible to abstract sensor network software from the characteristics of the underlying hardware components. In particular the radio has a major impact on higher level software. In this paper, we review the lessons we learnt using Bluetooth radios in the context of sensor networks. These lessons are relevant for (a) application designers choosing the best radio given a set of requirements and for (b) researchers in the data management community who need to formulate assumptions about underlying sensor networks. SIGMOD Record Database Research at UT Arlington. Sharma Chakravarthy,Y. Alp Aslandogan,Ramez Elmasri,Leonidas Fegaras,Jung-Hwan Oh 2003 Database Research at UT Arlington. SIGMOD Record 2003 SIGMOD Innovations Award Speech. Donald D. Chamberlin 2003 2003 SIGMOD Innovations Award Speech. SIGMOD Record A Graphical Query Language for Mobile Information Systems. Ya-Hui Chang 2003 The advance of the mobile computing environment allows data to be accessed in any place at any time, but currently only simple and ad-hoc queries are supported. People are eager for mobile information systems with more functionality and powerful querying facilities. In this paper, a graphical query language called MoSQL is proposed to be the basis of general mobile information systems. It provides a uniform way for users to access alphanumerical data and to query current or future location information, based on an icon-based interface. The interface is particularly suitable for the mobile environment, since it is easily operated by clicking or dragging the mouse. An example and the underlying theoretical framework will be presented in this paper to demonstrate the functionality of MoSQL. SIGMOD Record Managing uncertainty in sensor database. Reynold Cheng,Sunil Prabhakar 2003 Sensors are often employed to monitor continuously changing entities like locations of moving objects and temperature. The sensor readings are reported to a centralized database system, and are subsequently used to answer queries. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), the database may not be able to keep track of the actual values of the entities, and use the old values instead. Queries that use these old values may produce incorrect answers. However, if the degree of uncertainty between the actual data value and the database value is limited, one can place more confidence in the answers to the queries. In this paper, we present a frame-work that represents uncertainty of sensor data. Depending on the amount of uncertainty information given to the application, different levels of imprecision are presented in a query answer. We examine the situations when answer imprecision can be represented qualitatively and quantitatively. We propose a new kind of probabilistic queries called Probabilistic Threshold Query, which requires answers to have probabilities larger than a certain threshold value. We also study techniques for evaluating queries under different details of uncertainty, and investigate the tradeoff between data uncertainty, answer accuracy and computation costs. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Mitch Cherniack,Gottfried Vossen 2003 Reminiscences on Influential Papers. SIGMOD Record A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary B-trees. Eugene Inseok Chong,Jagannathan Srinivasan,Souripriya Das,Chuck Freiwald,Aravind Yalamanchi,Mahesh Jagannath,Anh-Tuan Tran,Ramkumar Krishnan,Richard Jiang 2003 Any auxiliary structure, such as a bitmap or a B+-tree index, that refers to rows of a table stored as a primary B+-tree (e.g., tables with clustered index in Microsoft SQL Server, or index-organized tables in Oracle) by their physical addresses would require updates due to inherent volatility of those addresses. To address this problem, we propose a mapping mechanism that 1) introduces a single mapping table, with each row holding one key value from the primary B+-tree, as an intermediate structure between the primary B+-tree and the associated auxiliary structures, and 2) augments the primary B+-tree structure to include in each row the physical address of the corresponding mapping table row. The mapping table row addresses can then be used in the auxiliary structures to indirectly refer to the primary B+-tree rows. The two key benefits are: 1) the mapping table shields the auxiliary structures from the volatility of the primary B+-tree row addresses, and 2) the method allows reuse of existing conventional table mechanisms for supporting auxiliary structures on primary B+-trees. The mapping mechanism is used for supporting bitmap indexes on index-organized tables in Oracle9i. The analytical and experimental studies show that the method is storage efficient, and (despite the mapping table overhead) provides performance benefits that are similar to those provided by bitmap indexes implemented on conventional tables. SIGMOD Record Towards a database body of knowledge: a study from Spain. Coral Calero,Mario Piattini,Francisco Ruiz 2003 "Databases, the center of today's information systems, are becoming more and more important judging by the huge volume of business they generate. In fact, database related material is included in a variety of curricula proposed by international organizations and prestigious universities. However, a systemized database body of knowledge (DBBOK), analogous to other works in Software Engineering (SWEBOK) or in Project Management (PMBOK) is needed. In this paper, we propose a first draft for this DBBOK based on degree programs from a variety of universities, the most relevant international curricula and the contents of the latest editions of principle books on databases." SIGMOD Record Reasoning on regular path queries. Diego Calvanese,Giuseppe De Giacomo,Maurizio Lenzerini,Moshe Y. Vardi 2003 Current information systems are required to deal with more complex data with respect to traditional relational data. The database community has already proposed abstractions for these kinds of data, in particular in terms of semistructured data models. A semistructured model conceives a database essentially as a finite directed labeled graph whose nodes represent objects, and whose edges represent relationships between objects. In the same way as conjunctive queries form the core of any query language for the relational model, regular path queries (RPQs) and their variants are considered the basic querying mechanisms for semistructured data.Besides the basic task of query answering, i.e., evaluating a query over a database, databases should support other reasoning services related to querying. One of the most important is query containment, i.e., verifying whether for all databases the answer to a query is a subset of the answer to a second query. Another important reasoning service that has received considerable attention in the recent years is view-based query processing, which amounts to processing queries based on a set of materialized views, rather than on the raw data in the database.The goal of this paper is to describe basic results and techniques concerning query containment and view based query processing for the class of two-way regular-path queries (which extend RPQs with the inverse operator). We will demonstrate that the basic services for reasoning about two way regular path queries are decidable, thus showing that the limited form of recursion expressible by these queries does not endanger the decidability of reasoning. Besides the specific results, our methods show the power of two-way automata in reasoning on complex queries. SIGMOD Record XML schema. Charles E. Campbell,Andrew Eisenberg,Jim Melton 2003 XML schema. SIGMOD Record Report on the 4th International Conference on Mobile Data Management. Panos K. Chrysanthis,Morris Sloman,Arkady B. Zaslavsky 2003 Report on the 4th International Conference on Mobile Data Management. SIGMOD Record Edgar F. Codd: a tribute and personal memoir. C. J. Date 2003 Edgar F. Codd: a tribute and personal memoir. SIGMOD Record ANSI SQL Hierarchical Processing Can Fully Integrate Native XML. Michael M. David 2003 Most SQL-based XML vendor support is through interoperation and not integration. One reason for this is that XML is inherently hierarchical and SQL is supposedly not. This paper demonstrates how ANSI SQL along with its relational Cartesian product model can naturally perform complete and flexible hierarchical query processing. With this ANSI SQL inherent hierarchical processing capability, native XML data can be fully and seamlessly integrated into SQL processing and operated on at a full hierarchical level. This paper will describe the basic stages involved in this hierarchical SQL processing: hierarchical data modeling, hierarchical working set creation, and hierarchical Cartesian product processing. These processes enable a complete relational, XML, and legacy data integration which maintains ANSI SQL compatibility even while performing the most complex multi-leg hierarchical processing, and includes the dynamic, direct, and controlled hierarchical joining of hierarchical structures. Also covered are ANSI SQL hierarchical support features: hierarchical SQL views, hierarchical data filtering, and hierarchical optimization. These make standard SQL a well rounded and complete hierarchical processor. With this full hierarchical level of processing established, it will be shown how the relational Cartesian product engine can be seamlessly replaced with a hierarchical engine, greatly increasing processing and memory utilization, and enabling advanced XML hierarchical data capabilities. SIGMOD Record The Cougar Project: a work-in-progress report. Alan J. Demers,Johannes Gehrke,Rajmohan Rajaraman,Agathoniki Trigoni,Yong Yao 2003 "We present an update on the status of the Cougar Sensor Database Project, in which we are investigating a database approach to sensor networks: Clients ""program"" the sensors through queries in a high-level declarative language (such as a variant of SQL). In this paper, we give an overview of our activities on energy-efficient data dissemination and query processing. Due to space constraints, we cannot present a full menu of results; instead, we decided to only whet the reader's appetite with some problems in energy-efficient routing and in-network aggregation and some thoughts on how to approach them." SIGMOD Record Research issues for data communication in mobile ad-hoc network database systems. Leslie D. Fife,Le Gruenwald 2003 Mobile Ad-hoc Networks (MANET) is an emerging area of research. Most current work is centered on routing issues. This paper discusses the issues associated with data communication with MANET database systems. While data push and data pull methods have been previously addressed in mobile networks, the proposed methods do not handle the unique requirements associated with MANET. Unlike traditional mobile networks, all nodes within the MANET are mobile and battery powered. Existing wireless algorithms and protocols are insufficient primarily because they do not consider the mobility and power requirements of both clients and servers. This paper will present some of the critical tasks facing this research. SIGMOD Record Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. Fernando Berzal Galiano,Nicolás Marín 2003 Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. SIGMOD Record Multimedia streaming in large-scale sensor networks with mobile swarms. Mario Gerla,Kaixin Xu 2003 "Sensor networking technologies have developed very rapidly in the last ten years. In many situations, high quality multimedia streams may be required for providing detailed information of the hot spots in a large scale network. With the limited capabilities of sensor node and sensor network, it is very difficult to support multimedia streams in current sensor network structure. In this paper, we propose to enhance the sensor network by deploying limited number of mobile ""swarms"". The swarm nodes have much higher capabilities than the sensor nodes in terms of both hardware functionalities and networking capabilities. The mobile swarms can be directed to the hot spots in the sensor network to provide detailed information of the intended area. With the help of mobile swarms, high quality of multimedia streams can be supported in the large scale sensor network without too much cost. The wireless backbone network for connecting different swarms and the routing schemes for supporting such a unified architecture is also discussed and verified via simulations." SIGMOD Record Toward network data independence. Joseph M. Hellerstein 2003 "A number of researchers have become interested in the design of global-scale networked systems and applications. Our thesis here is that the database community's principles and technologies have an important role to play in the design of these systems. The point of departure is at the roots of database research: we generalize Codd's notion of data independence to physical environments beyond storage systems. We note analogies between the development of database indexes and the new generation of structured peer-to-peer networks. We illustrate the emergence of data independence in networks by surveying a number of recent network facilities and applications, seen through a database lens. We present a sampling of database query processing techniques that can contribute in this arena, and discuss methods for adoption of these technologies." SIGMOD Record The sensor spectrum: technology, trends, and requirements. Joseph M. Hellerstein,Wei Hong,Samuel Madden 2003 Though physical sensing instruments have long been used in astronomy, biology, and civil engineering, the recent emergence of wireless sensor networks and RFID has spurred a renaissance in sensor interest in both academia and industry. In this paper, we examine the spectrum of sensing platforms, from billion dollar satellites to tiny RF tags, and discuss the technological differences between them. We show that battery powered sensor networks, with low-power multihop radios and low-cost processors, occupy a sweet spot in this spectrum that is rife with opportunity for novel database research. We briefly summarize some of our research work in this space and present a number of examples of interesting sensor network-related problems that the database community is uniquely equipped to address. SIGMOD Record Agent-Oriented software engineering report on the 4 AOSE workshop (AOSE 2003). Paolo Giorgini 2003 Agent-Orientation is emerging as a powerful new paradigm in computing. Concepts, methodologies and tools from the agents paradigm are one of the best candidates for the foundations of the next generation of mainstream software systems. The Agent-Oriented Software Engineering (AOSE) workshop is an international event that brings together researchers and groups active in the area of agent-based software development. Here we briefly report on the fourth edition of the AOSE workshop. SIGMOD Record Issues in data stream management. Lukasz Golab,M. Tamer Özsu 2003 Issues in data stream management. SIGMOD Record Fundamentals of data warehouses: 2 revised and extended edition. Vernon Hoffner 2003 Fundamentals of data warehouses: 2 revised and extended edition. SIGMOD Record XPath Processing in a Nutshell. Georg Gottlob,Christoph Koch,Reinhard Pichler 2003 We provide a concise yet complete formal definition of the semantics of XPath 1 and summarize e cient algorithms for processing queries in this language. Our presentation is intended both for the reader who is looking for a short but comprehensive formal account of XPath as well as the software developer in need of material that facilitates the rapid implementation of XPath engines. SIGMOD Record XPath processing in a nutshell. Georg Gottlob,Christoph Koch,Reinhard Pichler 2003 We provide a concise yet complete formal definition of the semantics of XPath 1 and summarize efficient algorithms for processing queries in this language. Our presentation is intended both for the reader who is looking for a short but comprehensive formal account of XPath as well as the software developer in need of material that facilitates the rapid implementation of XPath engines. SIGMOD Record Research in database engineering at the University of Namur. Jean-Luc Hainaut 2003 Research in database engineering at the University of Namur. SIGMOD Record Learning about data integration challenges from day one. Alon Y. Halevy 2003 I describe the format of the new version of an introductory database course that I taught at the University ofWashington inWinter, 2003. The key idea underlying the course is to expose the students to some of the challenges that arise when working with and integrating data from multiple database systems and applications. SIGMOD Record "Treasurer's Message." Joachim Hammer 2003 "Treasurer's Message." SIGMOD Record In Memory of Gísli R. Hjaltason. Björn Þór Jónsson 2003 In Memory of Gísli R. Hjaltason. SIGMOD Record Closing the key loophole in MLS databases. Nenad Jukic,Svetlozar Nestorov,Susan V. Vrbsky 2003 There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which obstructs the handling of clearance-invariant aggregate queries and physical-entity related queries where some of the information in the database may be gleaned from the outside world. This flaw stands in the way of a more pervasive adoption of MLS models by the developers of practical applications. This paper clearly identifies the cause of this impediment -- the cover story dependence on the value of a user-defined key -- and proposes a practical solution. SIGMOD Record Information integration on the Web: a view from AI and databases (report on IIWeb-03). Subbarao Kambhampati,Craig A. Knoblock 2003 Information integration on the Web: a view from AI and databases (report on IIWeb-03). SIGMOD Record Energy and rate based MAC protocol for wireless sensor networks. Rajgopal Kannan,Ramaraju Kalidindi,S. Sitharama Iyengar,Vijay Kumar 2003 Sensor networks are typically unattended because of their deployment in hazardous, hostile or remote environments. This makes the problem of conserving energy at individual sensor nodes challenging. S-MAC and PAMAS are two MAC protocols which periodically put nodes (selected at random) to sleep in order to achieve energy savings. Unlike these protocols, we propose an approach in which node duty cycles (i.e sleep and wake schedules) are based on their criticality. A distributed algorithm is used to find sets of winners and losers, who are then assigned appropriate slots in our TDMA based MAC protocol. We introduce the concept of of energy-criticality of a sensor node as a function of energies and traffic rates. Our protocol makes more critical nodes sleep longer, thereby balancing the energy consumption. Simulation results show that the performance of the protocol with increase in traffic load is better than existing protocols with increase in traffic load is better than existing protocols, thereby illustrating the energy balancing nature of the approach. SIGMOD Record "Report on the 5 international workshop on the design and management of data warehouses (DMDW'03)." Hans-Joachim Lenz,Panos Vassiliadis,Manfred A. Jeusfeld,Martin Staudt 2003 "Report on the 5 international workshop on the design and management of data warehouses (DMDW'03)." SIGMOD Record Selective information dissemination in P2P networks: problems and solutions. Manolis Koubarakis,Christos Tryfonopoulos,Stratos Idreos,Yannis Drougas 2003 We study the problem of selective dissemination of information in P2P networks. We present our work on data models and laiguages for textual information dissemination and discuss a relemnt P2P architecture that motivates our efforts. We also survey our results on the computational complexity of three related algorithmic problems (query satisfiability, entailment and filtering) and present efficient algorithms for the most crucial of these problems (filtering). Finally, we discuss the features of P2P-DIET, a super-peer system we have implemented at the Technical Lniversity of Crete, that realizes our vision and is able to support both ad-hoc querying and selective information dissemination scenarios in a P2P framework. SIGMOD Record "Editor's Notes." Ling Liu 2003 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2003 "Editor's Notes." SIGMOD Record Sensor: the atomic computing particle. Vijay Kumar 2003 We visualize the world as a fully connected information space where each object communicates with all other objects without any temporal and geographical constraints. We can model this fully connected space using fine granularity processing which can be implemented using sensors technology. We regard sensors as atomic computing particles which can deployed to geographical locations for capturing and processing data of their surrounding. This report introduces a number of excellent research articles which present unique problems and their success in finding efficient solutions for them. It also peeks in to the future of ever changing information processing discipline. SIGMOD Record "Editor's Notes." Ling Liu 2003 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2003 "Editor's Notes." SIGMOD Record Report on the First International Workshop an Efficient Web-based Information Systems. Zoé Lacroix,Omar Boucelma 2003 Report on the First International Workshop an Efficient Web-based Information Systems. SIGMOD Record Standards for databases on the grid. Susan Malaika,Andrew Eisenberg,Jim Melton 2003 Standards for databases on the grid. SIGMOD Record Understanding the semantics of sensor data. Murali Mani 2003 "Our system architecture to manage sensor data is described. Our data mining applications require past history of the sensor data. Therefore, unlike most present systems that focus on streaming data, and cache a small window of historic data, we store the entire historic data. Several interesting problems arise in these scenarios. We study two of them: (a) Given that a sensor can send data corresponding to its current configuration at any particular instant, how do we define the data that should be stored in the database? (b) Sensors try to minimize the amount of data transmitted. Also there could be data loss in the network. So the data stored will have lots of ""holes"". In this case, how can an application make sense of the stored data? In this paper, we describe our approach to solve these problems that enables an application to recreate the environment that generated the data as precisely as possible." SIGMOD Record Analysis of existing databases at the logical level: the DBA companion project. Fabien De Marchi,Stéphane Lopes,Jean-Marc Petit,Farouk Toumani 2003 Whereas physical database tuning has received a lot of attention over the last decade, logical database tuning seems to be under-studied. We have developed a project called DBA Companion devoted to the understanding of logical database constraints from which logical database tuning can be achieved.In this setting, two main data mining issues need to be addressed: the first one is the design of efficient algorithms for functional dependencies and inclusion dependencies inference and the second one is about the interestingness of the discovered knowledge. In this paper, we point out some relationships between database analysis and data mining. In this setting, we sketch the underlying themes of our approach. Some database applications that could benefit from our project are also described, including logical database tuning. SIGMOD Record Report on FQAS 2002: fifth international conference on flexible query answering systems. Amihai Motro,Troels Andreasen 2003 Report on FQAS 2002: fifth international conference on flexible query answering systems. SIGMOD Record An approach to confidence based page ranking for user oriented Web search. Debajyoti Mukhopadhyay,Debasis Giri,Sanasam Ranbir Singh 2003 An approach to confidence based page ranking for user oriented Web search. SIGMOD Record Peer-to-peer: harnessing the power of disruptive technologies. Mario A. Nascimento 2003 Peer-to-peer: harnessing the power of disruptive technologies. SIGMOD Record "Analysis of SIGMOD's co-authorship graph." Mario A. Nascimento,Jörg Sander,Jeffrey Pound 2003 "In this paper we investigate the co-authorship graph obtained from all papers published at SIGMOD between 1975 and 2002. We find some interesting facts, for instance, the identity of the authors who, on average, are ""closest"" to all other authors at a given time. We also show that SIGMOD's co-authorship graph is yet another example of a small world---a graph topology which has received a lot of attention recently. A companion web site for this paper can be found at http://db.cs.ualberta.ca/coauthorship." SIGMOD Record In-context peer-to-peer information filtering on the Web. Aris M. Ouksel 2003 In-context peer-to-peer information filtering on the Web. SIGMOD Record Design issues and challenges for RDF- and schema-based peer-to-peer systems. Wolfgang Nejdl,Wolf Siberski,Michael Sintek 2003 Databases have employed a schema-based approach to store and retrieve structured data for decades. For peer-to-peer (P2P) networks, similar approaches are just beginning to emerge. While quite a few database techniques can be re-used in this new context, a P2P data management infrastructure poses additional challenges which have to be solved before schema-based P2P networks become as common as schema-based databases. We will describe some of these challenges and discuss approaches to solve them. Our discussion will be based on the design decisions we have employed in our Edutella infrastructure, a schema-based P2P network based on RDF and RDF schemas, and will also point out additional work addressing the issues discussed. SIGMOD Record Relational data sharing in peer-based data management systems. Beng Chin Ooi,Yanfeng Shu,Kian-Lee Tan 2003 Data sharing in current P2P systems is very much restricted to file-system-like capabilities. In this paper, we present the strategies that we have adopted in our BestPeer project to support more fine-grained data sharing, especially, relational data sharing, in a P2P context. First, we look at some of the issues in designing a peer-based data management system, and discuss some possible solutions to address these issues. Second, we present the design of our first prototype system, PeerDB, and report our experience with it. Finally, we discuss our current extensions to PeerDB to support keyword-based queries. SIGMOD Record "Chair's Message." M. Tamer Özsu 2003 "Chair's Message." SIGMOD Record "Chair's Message." M. Tamer Özsu 2003 "Chair's Message." SIGMOD Record Distributed deviation detection in sensor networks. Themistoklis Palpanas,Dimitris Papadopoulos,Vana Kalogeraki,Dimitrios Gunopulos 2003 Sensor networks have recently attracted much attention, because of their potential applications in a number of different settings. The sensors can be deployed in large numbers in wide geographical areas, and can be used to monitor physical phenomena, or to detect certain events.An interesting problem which has not been adequately addressed so far is that of distributed online deviation detection in streaming data. The identification of deviating values provides an efficient way to focus on the interesting events in the sensor network.In this work, we propose a technique for online deviation detection in streaming data. We discuss how these techniques can operate efficiently in the distributed environment of a sensor network, and discuss the tradeoffs that arise in this setting. Our techniques process as much of the data as possible in a decentralized fashion, so as to avoid unnecessary communication and computational effort. SIGMOD Record Performing Jobs without Decompression in a Compressed Database System. "S. J. O'Connell,N. Winterbottom" 2003 There has been much work on compressing database indexes, but less on compressing the data itself. We examine the performance gains to be made by compression outside the index. A novel compression algorithm is reported, which enables the processing of queries without decompressing data needed to perform join operations in a database built on a triple store. The results of modelling the performance of the database with and without compression are given and compared with other recent work in this area. It is found that for some applications, gains in performance of over 50% are achievable, and in OLTP-like situations, there are also gains to be made. SIGMOD Record DBGlobe: a service-oriented P2P system for global computing. Evaggelia Pitoura,Serge Abiteboul,Dieter Pfoser,George Samaras,Michalis Vazirgiannis 2003 The challenge of peer-to-peer computing goes beyond simple file sharing. In the DBGlobe project, we view the multitude of peers carrying data and services as a superdatabase. Our goal is to develop a data management system for modeling, indexing and querying data hosted by such massively distributed, autonomous and possibly mobile peers. We employ a service-oriented approach, in that data are encapsulated in services. Direct querying of data is also supported by an XML-based query language. In this paper, we present our research results along the following topics: (a) infrastructure support, including mobile peers and the creation of context-dependent communities, (b) metadata management for services and peers, including location-dependent data, (c) filters for efficiently routing path queries on hierarchical data, and (d) querying using the AXML language that incorporates service calls inside XML documents. SIGMOD Record Reminiscences an Influential Papers. Kenneth A. Ross,Minos N. Garofalakis,Jeffrey F. Naughton 2003 Reminiscences an Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Johannes Gehrke,Jun Rao 2003 Reminiscences on Influential Papers. SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Frank Neven,Beng Chin Ooi 2003 Reminiscences on Influential Papers. SIGMOD Record Report on the 17th Brazilian Symposium on Database Systems (SBBD 2002). Ana Carolina Salgado,Nina Edelweiss 2003 Report on the 17th Brazilian Symposium on Database Systems (SBBD 2002). SIGMOD Record Information rules. Dale A. Stirling 2003 Information rules. SIGMOD Record ACM TODS in this Internet Age. Richard T. Snodgrass 2003 ACM TODS in this Internet Age. SIGMOD Record TODS Reviewers. Richard T. Snodgrass 2003 TODS Reviewers. SIGMOD Record Journal relevance. Richard T. Snodgrass 2003 Journal relevance. SIGMOD Record Developments at ACM TODS. Richard T. Snodgrass 2003 "The March 2005 issue of TODS has eight papers invited from the SIGMOD and PODS'2003 conferences. These papers are significantly extended versions of the conference papers, allowing the authors to refine and elaborate without the strictures of a twelve-page limit." SIGMOD Record Power efficient data gathering and aggregation in wireless sensor networks. Hüseyin Özgür Tan,Ibrahim Korpeoglu 2003 Recent developments in processor, memory and radio technology have enabled wireless sensor networks which are deployed to collect useful information from an area of interest. The sensed data must be gathered and transmitted to a base station where it is further processed for end-user queries. Since the network consists of low-cost nodes with limited battery power, power efficient methods must be employed for data gathering and aggregation in order to achieve long network lifetimes.In an environment where in a round of communication each of the sensor nodes has data to send to a base station, it is important to minimize the total energy consumed by the system in a round so that the system lifetime is maximized. With the use of data fusion and aggregation techniques, while minimizing the total energy per round, if power consumption per node can be balanced as well, a near optimal data gathering and routing scheme can be achieved in terms of network lifetime.So far, besides the conventional protocol of direct transmission, two elegant protocols called LEACH and PEGASIS have been proposed to maximize the lifetime of a sensor network. In this paper, we propose two new algorithms under name PEDAP (Power Efficient Data gathering and Aggregation Protocol), which are near optimal minimum spanning tree based routing schemes, where one of them is the power-aware version of the other. Our simulation results show that our algorithms perform well both in systems where base station is far away from and where it is in the center of the field. PEDAP achieves between 4x to 20x improvement in network lifetime compared with LEACH, and about three times improvement compared with PEGASIS. SIGMOD Record The Piazza peer data management project. Igor Tatarinov,Zachary G. Ives,Jayant Madhavan,Alon Y. Halevy,Dan Suciu,Nilesh N. Dalvi,Xin Dong,Yana Kadiyska,Gerome Miklau,Peter Mork 2003 "A major problem in today's information-driven world is that sharing heterogeneous, semantically rich data is incredibly difficult. Piazza is a peer data management system that enables sharing heterogeneous data in a distributed and scalable way. Piazza assumes the participants to be interested in sharing data, and willing to define pairwise mappings between their schemas. Then, users formulate queries over their preferred schema, and a query answering system expands recursively any mappings relevant to the query, retrieving data from other peers. In this paper, we provide a brief overview of the Piazza project including our work on developing mapping languages and query reformulation algorithms, assisting the users in defining mappings, indexing, and enforcing access control over shared data." SIGMOD Record A Web odyssey: from codd to XML. Victor Vianu 2003 A Web odyssey: from codd to XML. SIGMOD Record Review of Web caching and replication by Michael Rabinovich and Oliver Spatscheck. Addison Wesley 2002. Qiang Wang,Brian D. Davison 2003 Review of Web caching and replication by Michael Rabinovich and Oliver Spatscheck. Addison Wesley 2002. SIGMOD Record Report on the 10th Conference on Database Systems for Business, Technology, and the Web (BTW 2003). Gerhard Weikum,Harald Schöning,Erhard Rahm 2003 Report on the 10th Conference on Database Systems for Business, Technology, and the Web (BTW 2003). SIGMOD Record A Multi-paradigm Querying Approach for a Generic Multimedia Database Management System. Ji-Rong Wen,Qing Li,Wei-Ying Ma,HongJiang Zhang 2003 "To truly meet the requirements of multimedia database (MMDB) management, an integrated framework for modeling, managing and retrieving various kinds of media data in a uniform way is necessary. MediaLand is an experimental MMDB platform being developed at Microsoft Research Asia for users with different levels of experiences and expertise to manage and search multimedia repositories easily, efficiently, and cooperatively. Key features of MediaLand include a uniform data model for describing all kinds of media objects and their relationships, and a 4-tier architecture based on this data model. In this paper, a multi-paradigm querying approach of MediaLand is presented, in which multimedia queries are processed based on a seamless integration of various existing search approaches. In doing so, MediaLand also offers the feature of ""media independence"" which is analogous to the notion of ""data independence"" from the classic ANSI SPARC standard. By incorporating a rich set of facilities and techniques, MediaLand lays down a good foundation for addressing further research issues, such as multimedia query rewriting, optimization, and presentation." SIGMOD Record Review of Spatial databases with application to GIS by Philippe Rigaux, Michel Scholl, and Agnes Voisard. Morgan Kaufmann 2002. Nancy Wiegand 2003 Review of Spatial databases with application to GIS by Philippe Rigaux, Michel Scholl, and Agnes Voisard. Morgan Kaufmann 2002. SIGMOD Record Interview with Jim Gray. Marianne Winslett 2003 Interview with Jim Gray. SIGMOD Record Interview with Michael Stonebraker. Marianne Winslett 2003 Interview with Michael Stonebraker. SIGMOD Record Interview with Rakesh Agrawal. Marianne Winslett 2003 Interview with Rakesh Agrawal. SIGMOD Record Interview with Pat Selinger. Marianne Winslett 2003 Interview with Pat Selinger. SIGMOD Record Two-Body Job Searches. Marianne Winslett,Xiaosong Ma,Ting Yu 2003 Two-Body Job Searches. SIGMOD Record Modelling temporal thematic map contents. "Alberto d'Onofrio,Elaheh Pourabbas" 2003 Modelling temporal thematic map contents. ICDE Load Shedding for Aggregation Queries over Data Streams. Brian Babcock,Mayur Datar,Rajeev Motwani 2004 "Systems for processing continuous monitoring queriesover data streams must be adaptive because data streamsare often bursty and data characteristics may vary overtime. In this paper, we focus on one particular type ofadaptivity: the ability to gracefully degrade performancevia ""load shedding"" (dropping unprocessed tuples to reducesystem load) when the demands placed on the systemcannot be met in full given available resources. Focusingon aggregation queries, we present algorithms that determineat what points in a query plan should load sheddingbe performed and what amount of load should be shed ateach point in order to minimize the degree of inaccuracyintroduced into query answers. We report the results of experimentsthat validate our analytical conclusions." ICDE Web-Services Architecture for Efficient XML Data Exchange. Sihem Amer-Yahia,Yannis Kotidis 2004 Web-Services Architecture for Efficient XML Data Exchange. ICDE EShopMonitor: A Web Content Monitoring Tool. Neeraj Agrawal,Rema Ananthanarayanan,Rahul Gupta,Sachindra Joshi,Raghu Krishnapuram,Sumit Negi 2004 Data presented on commerce sites runs into thousandsof pages, and is typically delivered from multiple back-endsources. This makes it difficult to identify incorrect, anomalous,or interesting data such as $9.99 air fares, missinglinks, drastic changes in prices and addition of new productsor promotions. In this paper, we describe a systemthat monitors Websites automatically and generates varioustypes of reports so that the content of the site can be monitoredand the quality maintained. The solution designedand implemented by us consists of a site crawler that crawlsdynamic pages, an information miner that learns to extractuseful information from the pages based on examples providedby the user, and a reporter that can be configured bythe user to answer specific queries. The tool can also beused for identifying price trends and new products or promotionsat competitor sites. A pilot run of the tool has beensuccessfully completed at the ibm.com site. ICDE Efficient Incremental Validation of XML Documents. Denilson Barbosa,Alberto O. Mendelzon,Leonid Libkin,Laurent Mignet,Marcelo Arenas 2004 We discuss incremental validation of XML documentswith respect to DTDs and XML Schema definitions. We considerinsertions and deletions of subtrees, as opposed to leafnodes only, and we also consider the validation of ID andIDREF attributes. For arbitrary schemas, we give a worst-casen log n time and linear space algorithm, and showthat it often is far superior to revalidation from scratch. Wepresent two classes of schemas, which capture most real-lifeDTDs, and show that they admit a logarithmic timeincremental validation algorithm that, in many cases, requiresonly constant auxiliary space. We then discuss animplementation of these algorithms that is independent of,and can be customized for different storage mechanismsfor XML. Finally, we present extensive experimental resultsshowing that our approach is highly efficient and scalable. ICDE Improving Logging and Recovery Performance in Phoenix/App. Roger S. Barga,Shimin Chen,David B. Lomet 2004 "Phoenix/App supports software components whosestates are made persistent across a system crash via redorecovery, replaying logged interactions. Our initialprototype force logged all request/reply events resultingfrom inter-component method calls and returns. Thispaper describes an enhanced prototype that implements:(i) log optimizations to improve normal executionperformance; and (ii) checkpointing to improve recoveryperformance. Logging is reduced in two ways: (1) weonly log information required to remove non-determinism,and we only force the log when an event""commits"" the state of the component to other parts of thesystem; (2) we introduce new component types thatprovide our enhanced system with more information,enabling further reduction in logging. To improverecovery performance, we save the values of the fields ofa component to the log in an application ""checkpoint"".We describe the system elements that we exploit for theseoptimizations, and characterize the performance gainsthat result." ICDE Engineering a Fast Online Persistent Suffix Tree Construction. Srikanta J. Bedathur,Jayant R. Haritsa 2004 Online persistent suffix tree construction has been consideredimpractical due to its excessive I/O costs. However,these prior studies have not taken into account the effects ofthe buffer management policy and the internal node structureof the suffix tree on I/O behavior of construction andsubsequent retrievals over the tree. In this paper, we studythese two issues in detail in the context of large genomicDNA and Protein sequences. In particular, we make the followingcontributions: (i) a novel, low-overhead bufferingpolicy called TOP-Q which improves the on-disk behaviorof suffix tree construction and subsequent retrievals, and (ii)empirical evidence that the space efficient linked-list representationof suffix tree nodes provides significantly inferiorperformance when compared to the array representation.These results demonstrate that a careful choice ofimplementation strategies can make online persistent suffixtree construction considerably more scalable - in termsof length of sequences indexed with a fixed memory budget,than currently perceived. ICDE Peering and Querying e-Catalog Communities. Boualem Benatallah,Mohand-Said Hacid,Hye-Young Paik,Christophe Rey,Farouk Toumani 2004 Peering and Querying e-Catalog Communities. ICDE VirGIS: Mediation for Geographical Information Systems. Omar Boucelma,Mehdi Essid,Zoé Lacroix,Julien Vinel,Jean-Yves Garinet,Abdelkader Bétari 2004 VirGIS: Mediation for Geographical Information Systems. ICDE Meta Data Management. Philip A. Bernstein,Sergey Melnik 2004 Meta Data Management. ICDE BOSS: Browsing OPTICS-Plots for Similarity Search. Stefan Brecheisen,Hans-Peter Kriegel,Peer Kröger,Martin Pfeifle,Maximilian Viermetz,Marco Pötke 2004 BOSS: Browsing OPTICS-Plots for Similarity Search. ICDE XJoin Index: Indexing XML Data for Efficient Handling of Branching Path Expressions. Elisa Bertino,Barbara Catania,Wen Qiang Wang 2004 XJoin Index: Indexing XML Data for Efficient Handling of Branching Path Expressions. ICDE FLYINGDOC: An Architecture for Distributed, User-friendly, and Personalized Information Systems. Ilvio Bruder,Andre Zeitz,Holger Meyer,Birger Hänsel,Andreas Heuer 2004 FLYINGDOC: An Architecture for Distributed, User-friendly, and Personalized Information Systems. ICDE On Local Pruning of Association Rules Using Directed Hypergraphs. Sanjay Chawla,Joseph G. Davis,Gaurav Pandey 2004 In this paper we propose an adaptive local pruningmethod for association rules. Our method exploits the exactmapping between a certain class of association rules,namely those whose consequents are singletons and backwarddirected hypergraphs (B-Graphs). The hypergraphwhich represents the association rules is called an AssociationRules Network(ARN). Here we present a simple exampleof an ARN. In the full paper we prove several propertiesof the ARN and apply the results of our approach totwo popular data sets. ICDE Stream Query Processing for Healthcare Bio-sensor Applications. Chung-Min Chen,Hira Agrawal,Munir Cochinwala,David Rosenbluth 2004 The need of a data stream management system(DSMS), with the capability of querying continuous datastreams, has been well understood by the databaseresearch community and witnessed by a proliferation ofrelated publications in this area (see, e.g., for a partialsurvey). Examples of applications abound in manydomains: from environmental and military applicationsconsuming streams of sensor data, to telecommunicationsand data network assurance systems analyzing real-timenetwork traffic data.This article provides an overview on a DSMSprototype called T2. T2 inherits some of the concepts ofan early prototype, Tribeca, developed also atTelcordia, but with complete new design andimplementation in Java with an SQL-like query language. ICDE Improving Hash Join Performance through Prefetching. Shimin Chen,Anastassia Ailamaki,Phillip B. Gibbons,Todd C. Mowry 2004 Hash join algorithms suffer from extensive CPU cachestalls. This paper shows that the standard hash join algorithm for disk-oriented databases (i.e. GRACE) spends over73% of its user time stalled on CPU cache misses, and explores the use of prefetching to improve its cache performance. Applying prefetching to hash joins is complicatedby the data dependencies, multiple code paths, and inherent randomness of hashing. We present two techniques, group prefetching and software-pipelined prefetching, thatovercome these complications.These schemes achieve 2.0- 2.9X speedups for the join phase and 1.4-2.6X speedups forthe partition phase over GRACE and simple prefetching approaches. Compared with previous cache-aware approaches(i.e. cache partitioning), the schemes are at least 50% fasteron large relations and do not require exclusive use of theCPU cache to be effective. ICDE Detection and Correction of Conflicting Source Updates for View Maintenance. Songting Chen,Jun Chen,Xin Zhang,Elke A. Rundensteiner 2004 Data integration over multiple heterogeneous datasources has become increasingly important for modern applications. The integrated data is usually stored in materialized views for high availability and better performance. Such views must be maintained after the datasources change. In a loosely-coupled and dynamic environment, such as the Data Grid, the sources may autonomously change not only their data but also their schema, query capabilities or semantics, which may consequently cause theon-going view maintenance fail. In this paper, first, we analyze the maintenance errors and classify them into different classes of dependencies. We then propose severaldependency detection and correction algorithms to handle these new classes of concurrency. Our techniques arenot tied to specific maintenance algorithms nor to a particular data model. To our knowledge, this is the first completesolution to the view maintenance concurrency problems for both data and schema changes. We have implemented the proposed solutions and experimentally evaluated the impact of anomalies on maintenance performanceand trade-offs between different dependency detection algorithms. ICDE BEA Liquid Data for WebLogic: XML-Based Enterprise Information Integration. Michael J. Carey 2004 This presentation provides a technical overview ofBEA Liquid Data for WebLogic, a relatively newproduct from BEA Systems that provides enterpriseinformation integration capabilities to enterpriseapplications that are built and deployed using the BEAWebLogic Platform.Liquid Data takes an XML-centricapproach to tackling the long-standingproblem of integrating data from disparate datasources and making that information easily accessibleto applications.In particular, Liquid Data uses theforthcoming XQuery language standard as the basisfor defining integrated views of enterprise data andquerying over those views.We provide a briefoverview of the Liquid Data product architecture andthen discuss some of the query processing technologythat lies at the heart of the product. ICDE Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web. James Caverlee,Ling Liu,David Buttler 2004 In this paper, we introduce the concept of a QA-Pageletto refer to the content region in a dynamic page that containsquery matches. We present THOR, a scalable andefficient mining system for discovering and extracting QA-Pageletsfrom the Deep Web. A unique feature of THOR isits two-phase extraction framework. In the first phase, pagesfrom a deep web site are grouped into distinct clusters ofstructurally-similar pages. In the second phase, pages fromeach page cluster are examined through a subtree filteringalgorithm that exploits the structural and content similarityat subtree level to identify the QA-Pagelets. ICDE Storing XML (with XSD) in SQL Databases: Interplay of Logical and Physical Designs. Surajit Chaudhuri,Zhiyuan Chen,Kyuseok Shim,Yuqing Wu 2004 "Much of business XML data has accompanying XSD specifications. In many scenarios, ""shredding¿ such XML data into a relational storage is a popular paradigm. Optimizing evaluation of XPath queries over such XML data requires paying careful attention to both the logical and physical designs of the relational database where XML data is shredded. None of the existing solutions has taken into account physical design of the generated relational database. In this paper, we study the interplay of logical and physical design and conclude that 1) solving them independently leads to suboptimal performance and 2) there is substantial overlap between logical and physical designs: some well-known logical design transformations generate the same mappings as physical design. Furthermore, existing search algorithms are inefficient to search the extremely large space of logical and physical design combinations. We propose a search algorithm that carefully avoids searching duplicated mappings and utilizes the workload information to further prune the search space. Experimental results confirm the effectiveness of our approach." ICDE Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem. Surajit Chaudhuri,Venkatesh Ganti,Luis Gravano 2004 Queries with (equality or LIKE) selection predicatesover string attributes are widely used in relationaldatabases. However, state-of-the-art techniques forestimating selectivities of string predicates are often biasedtowards severely underestimating selectivities. In thispaper, we develop accurate selectivity estimators for stringpredicates that adapt to data and query characteristics,and which can exploit and build on a variety of existingestimators. A thorough experimental evaluation over realdata sets demonstrates the resilience of our estimators tovariations in both data and query characteristics. ICDE Multiresolution Indexing of XML for Frequent Queries. Hao He,Jun Yang 2004 XML and other types of semi-structured data are typicallyrepresented by a labeled directed graph. To speedup path expression queries over the graph, a variety ofstructural indexes have been proposed. They usually workby partitioning nodes in the data graph into equivalenceclasses and storing equivalence classes as index nodes.A(k)-index introduces the concept of local bisimilarity forpartitioning, allowing the trade-off between index size andquery answering power. However, all index nodes in A(k)-indexhave the same local similarity k, which cannot takeadvantage of the fact that a workload may contain path expressionsof different lengths, or that different parts of thedata graph may have different local similarity requirements.To overcome these limitations, we propose M(k)- andM*(k)-indexes. The basic M(k)-index is workload-aware:Like the previously proposed D(k)-index, it allows differentindex nodes to have different local similarity requirements,providing finer partitioning only for parts of the datagraph targeted by longer path expressions. Unlike D(k)-index,M(k)-index is never over-refined for irrelevant indexor data nodes. However, the workload-aware featurestill incurs overrefinement due to over-qualified parent indexnodes. Moreover, fine partitions penalize the performanceof short path expressions. To solve these problems,we further propose the M*(k)-index. An M*(k)-index consistsof a collection of indexes whose nodes are organizedin a partition hierarchy, allowing successively coarser partitioninginformation to co-exist with the finest partitioninginformation required. Experiments show that our indexesare superior to previously proposed indexes in terms of indexsize and query performance. ICDE SQLCM: A Continuous Monitoring Framework for Relational Database Engines. Surajit Chaudhuri,Arnd Christian König,Vivek R. Narasayya 2004 "The ability to monitor a database server is crucial foreffective database administration. Today's commercialdatabase systems support two basic mechanisms formonitoring: (a) obtaining a snapshot of counters tocapture current state, and (b) logging events in the serverto a table/file to capture history. In this paper we showthat for a large class of important databaseadministration tasks the above mechanisms areinadequate in functionality or performance. We presentan infrastructure called SQLCM that enables continuousmonitoring inside the database server and that has theability to automatically take actions based on monitoring.We describe the implementation of SQLCM in MicrosoftSQL Server and show how several common and importantmonitoring tasks can be easily specified in SQLCM. Ourexperimental evaluation indicates that SQLCM imposeslow overhead on normal server execution end enablesmonitoring tasks on a production server that would be tooexpensive using today's monitoring mechanisms." ICDE Multi-Scale Histograms for Answering Queries over Time Series Data. Lei Chen,M. Tamer Özsu 2004 Multi-Scale Histograms for Answering Queries over Time Series Data. ICDE An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting. Ding-Ying Chiu,Yi-Hung Wu,Arbee L. P. Chen 2004 Mining sequential patterns in large databases is animportant research topic. The main challenge of miningsequential patterns is the high processing cost due to thelarge amount of data. In this paper, we propose a newstrategy called DIrect Sequence Comparison (abbreviatedas DISC), which can find frequent sequences without havingto compute the support counts of non-frequent sequences.The main difference between the DISC strategy and theprevious works is the way to prune non-frequent sequences.The previous works are based on the anti-monotoneproperty, which prune the non-frequent sequencesaccording to the frequent sequences with shorter lengths.On the contrary, the DISC strategy prunes the non-frequentsequences according to the other sequences with the samelength. Moreover, we summarize three strategies used in theprevious works and design an efficient algorithm calledDISC-all to take advantages of all the four strategies. Theexperimental results show that the DISC-all algorithmoutperforms the PrefixSpan algorithm on mining frequentsequences in large databases. In addition, we analyze thesestrategies to design the dynamic version of our algorithm,which achieves a much better performance. ICDE Go Green: Recycle and Reuse Frequent Patterns. Gao Cong,Beng Chin Ooi,Kian-Lee Tan,Anthony K. H. Tung 2004 In constrained data mining, users can specify constraintsto prune the search space to avoid mining uninterestingknowledge.This is typically done by specifyingsome initial values of the constraints that aresubsequently refined iteratively until satisfactory resultsare obtained.Existing mining schemes treat each iterationas a distinct mining process, and fail to exploit theinformation generated between iterations.In this paper,we propose to salvage knowledge that is discoveredfrom an earlier iteration of mining to enhance subsequentrounds of mining.In particular, we look at howfrequent patterns can be recycled.Our proposed strategyoperates in two phases.In the first phase, frequentpatterns obtained from an early iteration are used tocompress a database.In the second phase, subsequentmining processes operate on the compressed database.We propose two compression strategies and adapt threeexisting frequent pattern mining techniques to exploitthe compressed database.Results from our extensiveexperimental study show that our proposed recycling algorithmsoutperform their non-recycling counterpart byan order of magnitude. ICDE Approximate Aggregation Techniques for Sensor Databases. Jeffrey Considine,Feifei Li,George Kollios,John W. Byers 2004 In the emerging area of sensor-based systems, a significantchallenge is to develop scalable, fault-tolerantmethods to extract useful information from the data thesensors collect.An approach to this data managementproblem is the use of sensor database systems, exemplifiedby TinyDB and Cougar, which allow users to performaggregation queries such as MIN, COUNT andAVG on a sensor network.Due to power and range constraints,centralized approaches are generally impractical,so most systems use in-network aggregation to reducenetwork traffic.However, these aggregation strategiesbecome bandwidth-intensive when combined with thefault-tolerant, multi-path routing methods often used inthese environments.For example, duplicate-sensitive aggregatessuch as SUM cannot be computed exactly usingsubstantially less bandwidth than explicit enumeration.To avoid this expense, we investigate the use of approximatein-network aggregation using small sketches.Our contributions are as follows: 1) we generalize wellknown duplicate-insensitive sketches for approximatingCOUNT to handle SUM, 2) we present and analyze methodsfor using sketches to produce accurate results withlow communication and computation overhead, and 3)we present an extensive experimental validation of ourmethods. ICDE Lazy Database Replication with Ordering Guarantees. Khuzaima Daudjee,Kenneth Salem 2004 Lazy replication is a popular technique for improvingthe performance and availability of database systems. Althoughthere are concurrency control techniques whichguarantee serializability in lazy replication systems, thesetechniques may result in undesirable transaction orderings.Since transactions may see stale data, they may be serializedin an order different from the one in which they weresubmitted. Strong serializability avoids such problems, butit is very costly to implement. In this paper, we propose ageneralized form of strong serializability that is suitable foruse with lazy replication. In addition to having many of theadvantages of strong serializability, it can be implementedmore efficiently. We show how generalized strong serializabilitycan be implemented in a lazy replication system, andwe present the results of a simulation study that quantifiesthe strengths and limitations of the approach. ICDE Minimization and Group-By Detection for Nested XQueries. Alin Deutsch,Yannis Papakonstantinou,Yu Xu 2004 Minimization and Group-By Detection for Nested XQueries. ICDE Range CUBE: Efficient Cube Computation by Exploiting Data Correlation. Ying Feng,Divyakant Agrawal,Amr El Abbadi,Ahmed Metwally 2004 Data cube computation and representation are prohibitivelyexpensive in terms of time and space. Prior workhas focused on either reducing the computation time or condensingthe representation of a data cube. In this paper,we introduce Range Cubing as an efficient way to computeand compress the data cube without any loss of precision.A new data structure, range trie, is used to compress andidentify correlation in attribute values, and compress theinput dataset to effectively reduce the computational cost.The range cubing algorithm generates a compressed cube,called range cube, which partitions all cells into disjointranges. Each range represents a subset of cells with thesame aggregation value, as a tuple which has the same numberof dimensions as the input data tuples. The range cubepreserves the roll-up/drill-down semantics of a data cube.Compared to H-Cubing, experiments on real dataset showa running time of less than one thirtieth, still generating arange cube of less than one ninth of the space of the fullcube, when both algorithms run in their preferred dimensionorders. On synthetic data, range cubing demonstratesmuch better scalability, as well as higher adaptiveness toboth data sparsity and skew. ICDE Database Research for the Current Millennium. Daniela Florescu 2004 Database Research for the Current Millennium. ICDE XML Query Processing. Daniela Florescu,Donald Kossmann 2004 XML Query Processing. ICDE A Flexible Infrastructure for Gathering XML Statistics and Estimating Query Cardinality. Juliana Freire,Maya Ramanath,Lingzhi Zhang 2004 A Flexible Infrastructure for Gathering XML Statistics and Estimating Query Cardinality. ICDE """My Personal Web"": A Seminar on Personalization and Privacy for Web and Converged Services." Irini Fundulaki,Richard Hull,Bharat Kumar,Daniel F. Lieuwen,Arnaud Sahuguet 2004 """My Personal Web"": A Seminar on Personalization and Privacy for Web and Converged Services." ICDE OntoBuilder: Fully Automatic Extraction and Consolidation of Ontologies from Web Sources. Avigdor Gal,Giovanni A. Modica,Hasan M. Jamil 2004 OntoBuilder: Fully Automatic Extraction and Consolidation of Ontologies from Web Sources. ICDE Querying the Past, the Present, and the Future. Dieter Gawlick 2004 Querying the Past, the Present, and the Future. ICDE Applications for Expression Data in Relational Database System. Dieter Gawlick,Dmitry Lenkov,Aravind Yalamanchi,Lucy Chernobrod 2004 The support for the expression data type in arelational database system allows storing of conditionalexpressions as data in database tables and evaluatingthem using SQL queries. In the context of this newcapability, expressions can be interpreted asdescriptions, queries, and filters, and this significantlybroadens the use of a relational database system tosupport new types of applications. The paper presentsan overview of the expression data type, relatesexpressions to descriptions, queries, and filters,considers applications pertaining to informationdistribution, demand analysis, and task assignment, andshows how these applications can be easily supportedwith improved functionality. ICDE Bulk Operations for Space-Partitioning Trees. Thanaa M. Ghanem,Rahul Shah,Mohamed F. Mokbel,Walid G. Aref,Jeffrey Scott Vitter 2004 The emergence of extensible index structures, e.g.,GiST (Generalized Search Tree) and SP-GiST (Space-PartitioningGeneralized Search Tree), calls for a set ofextensible algorithms to support different operations (e.g.,insertion, deletion, and search). Extensible bulk operations(e.g., bulk loading and bulk insertion) are of the same importanceand need to be supported in these index engines.In this paper, we propose two extensible buffer-based algorithmsfor bulk operations in the class of space-partitioningtrees; a class of hierarchical data structures that recursivelydecompose the space into disjoint partitions. Themain idea of these algorithms is to build an in-memory treeof the target space-partitioning index. Then, data itemsare recursively partitioned into disk-based buffers usingthe in-memory tree. Although the second algorithm is designedfor bulk insertion, it can be used in bulk loading aswell. The proposed extensible algorithms are implementedinside SP-GiST; a framework for supporting the class ofspace-partitioning trees. Both algorithms have I/O boundO(NH/B), whereN is the number of data items to be bulkloaded/inserted, B is the number of tree nodes that can fitin one disk page, H is the tree height in terms of pages afterapplying a clustering algorithm. Experimental results areprovided to show the scalability and applicability of the proposedalgorithms for the class of space-partitioning trees.A comparison of the two proposed algorithms shows thatthe first algorithm performs better in case of bulk loading.However the second algorithm is more general and can beused for efficient bulk insertion. ICDE Using Stream Semantics for Continuous Queries in Media Stream Processors. Amarnath Gupta,Bin Liu,Pilho Kim,Ramesh Jain 2004 Using Stream Semantics for Continuous Queries in Media Stream Processors. ICDE Using vTree Indices for Queries over Objects with Complex Motions. Sandeep Gupta,Chinya V. Ravishankar 2004 Using vTree Indices for Queries over Objects with Complex Motions. ICDE Driving Forces in Database Technology. Steven Hagan 2004 "Several forces, with impacts so fundamental thatthey are akin to tectonic plate movements, are drivingthe commercial database marketplace. First ishardware commoditization: arrays of low pricedcomputers with high speed interconnects which yieldthe new cluster based computing capabilities referredto as 'Grid,' 'Utility,' and 'on-demand' computing, atprice points radically lower than standard Moore's lawprojections. The dramatic reductions in online storagehardware costs now makes it cost effective forcompanies to keep previously unimagined amounts ofcomplex data online. This will enable V/ULDBprojects with petabyte databases such as online imageapplications and data-driven supply chain managementapproaches (e.g. RFID) that store huge volumes ofhighly granular detail information in data warehouses(with significant history of temporal and spatialinterest)." ICDE Nile: A Query Processing Engine for Data Streams. Moustafa A. Hammad,Mohamed F. Mokbel,Mohamed H. Ali,Walid G. Aref,Ann Christine Catlin,Ahmed K. Elmagarmid,Mohamed Y. Eltabakh,Mohamed G. Elfeky,Thanaa M. Ghanem,Robert Gwadera,Ihab F. Ilyas,Mirette S. Marzouk,Xiaopeng Xiong 2004 Nile: A Query Processing Engine for Data Streams. ICDE Publish/Subscribe in NonStop SQL: Transactional Streams in a Relational Context. Mike Hanlon,Johannes Klein,Robbert C. Van der Linden,Hansjörg Zeller 2004 Relational queries on continuous streams of data arethe subject of many recent database research projects. In1998 a small group of people started a similar projectwith the goal to transform our product, NonStop SQL/MX,into an active RDBMS. This project tried to integratefunctionality of transactional queuing systems with relationaltables and with SQL, using simple extensions to theSQL syntax and guaranteeing clearly defined query andtransactional semantics. The result is the first commerciallyavailable RDBMS that incorporates streams. Alldata flowing through the system is contained in relationaltables and is protected by ACID transactions. Insert andupdate operations on any NonStop SQL table can be consideredpublishing of data and can therefore be transparentto the (legacy) applications performing them. Unliketriggers, the publish operation does not increase the pathlength of the application and it allows the subscriber toexecute in a separate transaction. Subscribers, using anextended SQL syntax, see a continuous stream of data,consisting of all rows originally in the table plus all rowsthat are inserted or updated thereafter. The system scalesby using partitioned tables and therefore partitionedstreams. ICDE Implementation and Research Issues in Query Processing for Wireless Sensor Networks . Wei Hong,Samuel Madden 2004 Implementation and Research Issues in Query Processing for Wireless Sensor Networks . ICDE Mining the Web for Generating Thematic Metadata from Textual Data. Chien-Chung Huang,Shui-Lung Chuang,Lee-Feng Chien 2004 Mining the Web for Generating Thematic Metadata from Textual Data. ICDE ContextMetricsTM: Semantic and Syntactic Interoperability in Cross-Border Trading Systems. Chito Jovellanos 2004 "This paper describes a method and system forquantifying the variances in the semantics and syntaxof electronic transactions exchanged betweenbusiness counterparties. ContextMetricsTM enables (a)dynamic transformations of outbound and inboundtransactions needed to effect 'straight-through-processing'(STP); (b) unbiased assessments ofcounterparty systems' capabilities to support STP;and (c) modeling of operational risks and financialexposures stemming from an enterprise'stransactional systems." ICDE Efficient Similarity Search in Large Databases of Tree Structured Objects. Karin Kailing,Hans-Peter Kriegel,Stefan Schönauer,Thomas Seidl 2004 Efficient Similarity Search in Large Databases of Tree Structured Objects. ICDE ItCompress: An Iterative Semantic Compression Algorithm. H. V. Jagadish,Raymond T. Ng,Beng Chin Ooi,Anthony K. H. Tung 2004 "Real datasets are often large enough to necessitate datacompression. Traditional 'syntactic' data compression methodstreat the table as a large byte string and operate at thebyte level. The tradeoff in such cases is usually between theease of retrieval (the ease with which one can retrieve a singletuple or attribute value without decompressing a much largerunit) and the effectiveness of the compression. In this regard,the use of semantic compression has generated considerableinterest and motivated certain recent works.In this paper, we propose a semantic compression algorithmcalled ItCompress ITerative Compression, whichachieves good compression while permitting access even atattribute level without requiring the decompression of a largerunit. ItCompress iteratively improves the compression ratioof the compressed output during each scan of the table. Theamount of compression can be tuned based on the number ofiterations. Moreover, the initial iterations provide significantcompression, thereby making it a cost-effective compressiontechnique. Extensive experiments were conducted and the resultsindicate the superiority of ItCompress with respect topreviously known tehniques, such as 'SPARTAN' and 'fascicles'." ICDE On the Integration of Structure Indexes and Inverted Lists. Raghav Kaushik,Rajasekar Krishnamurthy,Jeffrey F. Naughton,Raghu Ramakrishnan 2004 Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists. We propose a strategy that combines the two forms of auxiliary indexes, and a query evaluation algorithm for branching path expressions based on this strategy. Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms. Our experiments over the Niagara XML DBMS show the benefit of integrating the two forms of indexes. We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated. By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al., we obtain instance optimal algorithms to push down top k computation. ICDE Similarity Search in Multimedia Databases. Daniel A. Keim,Benjamin Bustos 2004 Similarity Search in Multimedia Databases. ICDE SQUIRE: Sequential Pattern Mining with Quantities. Chulyun Kim,Jong-Hwa Lim,Raymond T. Ng,Kyuseok Shim 2004 In this paper, we consider the problem of mining sequentialpatterns with quantities. Naive extensions to existingalgorithms for sequential patterns are inefficient, as theymay enumerate the search space blindly. To alleviate thesituation, we propose hash filtering and quantity samplingtechniques that significantly improve the performance of thenaive extensions. ICDE MTCache: Transparent Mid-Tier Database Caching in SQL Server. Per-Åke Larson,Jonathan Goldstein,Jingren Zhou 2004 "Many applications today run in a multi-tier environmentwith browser-based clients, mid-tier (application)servers and a backend database server.Mid-tier databasecaching attempts to improve system throughput and scalabilityby offloading part of the database workload to intermediatedatabase servers that partially replicate datafrom the backend server.The fact that some queries areoffloaded to an intermediate server should be completelytransparent to applications - one of the key distinctionsbetween caching and replication.MTCache is a prototypemid-tier database caching solution for SQL Server thatachieves this transparency.It builds on SQL Server's supportfor materialized views, distributed queries and replication.This paper describes MTCache and reportsexperimental results on the TPC-W benchmark.The experimentsshow that a significant part of the query workloadcan be offloaded to cache servers, resulting in greatlyimproved scale-out on the read-dominated workloads ofthe benchmark.Replication overhead was small with anaverage replication delay of less than two seconds." ICDE Incorporating Updates in Domain Indexes: Experiences with Oracle Spatial R-trees. Kothuri Venkata Ravi Kanth,Siva Ravada,Ning An 2004 Much research has been devoted to scalable storage andretrieval techniques for domain databases such as spatial,text, xml and gene sequence data. Many efficient indexingtechniques have been developed in this context. Given theimprovement in the underlying technology, database applicationsare increasingly using domain data in transactionalsemantics. In this paper, we examine the issue of when duringthe lifetime of a transaction is it better to incorporateupdates in domain indexes. We present our experiences withR-tree indexes in Oracle.We examine two approaches for incorporating updatesin spatial R-tree indexes: the first at update time, and thesecond at commit time. The first approach immediatelyincorporates changes in the index right away using systemtransactions and at commit time makes them visibleto other transactions. The second approach, referred toas the deferred-incorporate approach, defers the updatesin a secondary table and incorporates the changes in theindex only at commit time. In experiments on real datasets, we compare the performance of the two approaches.For most transactions with reasonable number of updateoperations, we observe that the deferred approach outperformsthe immediate-incorporate approach significantlyfor update operations and with appropriate optimizationsachieves comparable query performance. ICDE Approximate Selection Queries over Imprecise Data. Iosif Lazaridis,Sharad Mehrotra 2004 We examine the problem of evaluating selection queriesover imprecisely represented objects. Such objects are usedeither because they are much smaller in size than the preciseones (e.g., compressed versions of time series), or asimprecise replicas of fast-changing objects across the network(e.g., interval approximations for time-varying sensorreadings). It may be impossible to determine whether an impreciseobject meets the selection predicate. Additionally,the objects appearing in the output are also imprecise. Retrievingthe precise objects themselves (at additional cost)can be used to increase the quality of the reported answer.In our paper we allow queries to specify their own answerquality requirements. We show how the query evaluationsystem may do the minimal amount of work to meetthese requirements. Our work presents two important contributions:first, by considering queries with set-based answers,rather than the approximate aggregate queries overnumerical data examined in the literature; second, by aimingto minimize the combined cost of both data processingand probe operations in a single framework. Thus, we establishthat the answer accuracy/performance tradeoff canbe realized in a more general setting than previously seen. ICDE LDC: Enabling Search By Partial Distance In A Hyper-Dimensional Space. Nick Koudas,Beng Chin Ooi,Heng Tao Shen,Anthony K. H. Tung 2004 Recent advances in research fields like multimediaand bioinformatics have brought about a new generation of hyper-dimensional databases which can contain hundreds or even thousands of dimensions. Such hyper-dimensional databases pose significant problems to existinghigh-dimensional indexing techniques which have been developed for indexing databases with (commonly) lessthan a hundred dimensions. To support efficient querying and retrieval on hyper-dimensional databases, we propose a methodology called Local Digital Coding (LDC)which can support k-nearest neighbors (KNN) queries onhyper-dimensional databases and yet co-exist with ubiquitous indices, such as B+-trees. LDC extracts a simple bitmap representation called Digital Code(DC) for each point in the database.Pruning during KNN search is performed by dynamically selecting only a subset of the bits from the DC based on which subsequent comparisons are performed. In doing so, expensive operations involved in computing L-norm distance functions between hyper-dimensional data can be avoided. Extensive experiments are conducted to show that our methodology offers significant performance advantages over other existing indexing methods on both real life and synthetic hyper-dimensional datasets. ICDE Routing XML Queries. Nick Koudas,Michael Rabinovich,Divesh Srivastava,Ting Yu 2004 Routing XML Queries. ICDE Personalization of Queries in Database Systems. Georgia Koutrika,Yannis E. Ioannidis 2004 As information becomes available in increasingamounts to a wide spectrum of users, the need fora shift towards a more user-centered informationaccess paradigm arises. We develop a personalizationframework for database systems based onuser profiles and identify the basic architecturalmodules required to support it. We define a preferencemodel that assigns to each atomic querycondition a personal degree of interest and providea mechanism to compute the degree of interestin any complex query condition based on thedegrees of interest in the constituent atomic ones.Preferences are stored in profiles. At query time,personalization proceeds in two steps: (a) preferenceselection and (b) preference integration intothe original user query. We formulate the mainpersonalization step, i.e. preference selection, asa graph computation problem and provide an efficientalgorithm for it. We also discuss results ofexperimentation with a prototype query personalization system. ICDE Efficient Execution of Computation Modules in a Model with Massive Data. Gary Kratkiewicz,Renu Kurien Bostwick,Geoffrey S. Knauth 2004 Efficient Execution of Computation Modules in a Model with Massive Data. ICDE Recursive XML Schemas, Recursive XML Queries, and Relational Storage: XML-to-SQL Query Translation. Rajasekar Krishnamurthy,Venkatesan T. Chakaravarthy,Raghav Kaushik,Jeffrey F. Naughton 2004 "We consider the problem of translating XML queries intoSQL when XML documents have been stored in an RDBMSusing a schema-based relational decomposition. Surprisingly,there is no published XML-to-SQL query translationalgorithm for this scenario that handles recursive XMLschemas. We present a generic algorithm to translate pathexpression queries into SQL in the presence of recursionin the schema and queries. This algorithm handles a generalclass of XML-to-Relational mappings, which includesall techniques proposed in literature. Some of the salientfeatures of this algorithm are: (i) It translates a path expressionquery into a single SQL query, irrespective of howcomplex the XML schema is, (ii) It uses the ""with"" clause inSQL99 to handle recursive queries even over non-recursiveschemas, (iii) It reconstructs recursive XML subtrees witha single SQL query and (iv) It shows that the support forlinear recursion in SQL99 is sufficient for handling pathexpression queries over arbitrarily complex recursive XMLschema." ICDE Enabling Communities of Knowledge Workers. David Lehman 2004 The MITRE corporation is a geographicallydistributed company of knowledge workers workingon projects either related by the same customer or bythe problems and technologies being addressed.MITRE early on recognized the power of informationtechnology to enhance the effectiveness of each of itsstaff with knowledge sharing tools. Early deploymentof an Intranet with novel functions has promotedinformation sharing among the staff, enabling eachstaff member to effectively leverage the knowledge ofthe entire company. Specific examples, storagestructures, information spaces, search mechanisms andpolicy issues that have promoted sharing have revealedmany lessons for enabling the knowledge worker. ICDE LexEQUAL: Supporting Multilexical Queries in SQL. A. Kumaran,Jayant R. Haritsa 2004 LexEQUAL: Supporting Multilexical Queries in SQL. ICDE RACCOON: A Peer-Based System for Data Integration and Sharing. Chen Li,Jia Li,Qi Zhong 2004 RACCOON: A Peer-Based System for Data Integration and Sharing. ICDE Spectral Analysis of Text Collection for Similarity-based Clustering. Wenyuan Li,Wee Keong Ng,Ee-Peng Lim 2004 Spectral Analysis of Text Collection for Similarity-based Clustering. ICDE Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. Xuemin Lin,Hongjun Lu,Jian Xu,Jeffrey Xu Yu 2004 Statistics over the most recently observed data elementsare often required in applications involving data streams,such as intrusion detection in network monitoring, stockprice prediction in financial markets, web log mining foraccess prediction, and user click stream mining for personalization.Among various statistics, computing quantilesummary is probably most challenging because of its complexity.In this paper, we study the problem of continuouslymaintaining quantile summary of the most recentlyobserved N elements over a stream so that quantile queriescan be answered with a guaranteed precision of ¿N.Wedeveloped a space efficient algorithm for pre-defined Nthat requires only one scan of the input data stream andO({{\log ( \in ^2 N)} \over\in } + {1 \over { \in ^2 }}) space in the worst cases.We alsodeveloped an algorithm that maintains quantile summaries formost recent N elements so that quantile queries on any mostrecent n elements (n ¿ N) can be answered with a guaranteedprecision of ¿n.The worst case space requirement forthis algorithm is only O({{\log ^2 ( \in N)} \over { \in ^2 }}).Our performance studyindicated that not only the actual quantile estimation erroris far below the guaranteed precision but the space requirementis also much less than the given theoretical bound. ICDE Algebraic Signatures for Scalable Distributed Data Structures. Witold Litwin,Thomas J. E. Schwarz 2004 Signatures detect changes to data objects.Numerous schemes are in use, especially thecryptographically secure standards SHA-1. Wepropose a novel signature scheme which we callalgebraic signatures. The scheme uses the Galois Fieldcalculations. Its major property is the sure detection ofany changes up to a parameterized size. Moreprecisely, we detect for sure any changes that do notexceed n-symbols for an n-symbol algebraic signature.This property is new for any known signature scheme.For larger changes, the collision probability istypically negligible, as for the other known schemes.We apply the algebraic signatures to the ScalableDistributed Data Structures (SDDS). We filter at theSDDS client node the updates that do not actuallychange the records. We also manage the concurrentupdates to data stored in the SDDS RAM buckets at theserver nodes. We further use the scheme for the fastdisk backup of these buckets. We sign our objects with4-byte signatures, instead of 20-byte standard SHA-1signatures. Our algebraic calculus is then also abouttwice as fast. ICDE Modeling Uncertainties in Publish/Subscribe Systems. Haifeng Liu,Hans-Arno Jacobsen 2004 In the publish/subscribe paradigm, informationproviders disseminate publications to all consumers whohave expressed interest by registering subscriptions. Thisparadigm has found wide-spread applications, rangingfrom selective information dissemination to network management.However, all existing publish/subscribe systemscannot capture uncertainty inherent to the information ineither subscriptions or publications. In many situations,exact knowledge of either specific subscriptions or publicationsis not available. Moreover, especially in selectiveinformation dissemination applications, it is often moreappropriate for a user to formulate her search requestsor information offers in less precise terms, rather thandefining a sharp limit. To address these problems, thispaper proposes a new publish/subscribe model based onpossibility theory and fuzzy set theory to process uncertaintiesfor both subscriptions and publications. Furthermore,an approximate publish/subscribe matching problem isdefined and algorithms for solving it are developed andevaluated. ICDE A Probabilistic Approach to Metasearching with Adaptive Probing. Zhenyu Liu,Chang Luo,Junghoo Cho,Wesley W. Chu 2004 "An ever-increasing amount of valuable information isstored in Web databases, ""hidden"" behind search interfaces.To save the user's effort in manually exploring eachdatabase, metasearchers automatically select the most relevantdatabases to a user's query. In thispaper, we focus on one of the technical challenges in metasearching,namely database selection. Past research uses a pre-collectedsummary of each database to estimate its ""relevancy"" to thequery, and in many cases make incorrect database selection.In this paper, we propose two techniques: probabilisticrelevancy modelling and adaptive probing. First, we modelthe relevancy of each database to a given query as a probabilisticdistribution, derived by sampling that database. Usingthe probabilistic model, the user can explicitly specify a desiredlevel of certainty for database selection. The adaptiveprobing technique decides which and how many databases to contactin order to satisfy the user's requirement. Our experimentson real Hidden-Web databases indicate that our approach significantlyimproves the accuracy of database selection at the cost ofa small number of database probing." ICDE Simple, Robust and Highly Concurrent B-trees with Node Deletion. David B. Lomet 2004 "Why might B-tree concurrency control still beinteresting? For two reasons: (i) currentlyexploited ""real world"" approaches arecomplicated; (ii) simpler proposals are not usedbecause they are not sufficiently robust. In the""real world"", systems need to deal robustly withnode deletion, and this is an important reasonwhy the currently exploited techniques arecomplicated. In our effort to simplify the worldof robust and highly concurrent B-tree methods,we focus on exactly where b-tree concurrencycontrol needs information about node deletes,and describe mechanisms that provide thatinformation. We exploit the Blink-tree property ofbeing ""well-formed"" even when index termposting for a node split has not been completedto greatly simplify our algorithms. Our goal is todescribe a very simple but nonetheless robustmethod." ICDE Database Kernel Research: What, if anything, is left to do? David B. Lomet 2004 Database Kernel Research: What, if anything, is left to do? ICDE DBA Companion: A Tool for Logical Database Tuning. Stéphane Lopes,Fabien De Marchi,Jean-Marc Petit 2004 DBA Companion: A Tool for Logical Database Tuning. ICDE Content-based Three-dimensional Engineering Shape Search. Kuiyang Lou,S. Prabhakar,Karthik Ramani 2004 In this paper, we discuss the design andimplementation of a prototype 3D Engineering ShapeSearch system. The system incorporates multiplefeature vectors, relevance feedback, and query byexample and browsing, flexible definition of shapesimilarity, and efficient execution through multi-dimensionalindexing and clustering. In order to offermore information for a user to determine similarity of3D engineering shape, a 3D interface that allows usersto manipulate shapes is proposed and implemented topresent the search results. The system allows users tospecify which feature vectors should be used toperform the search.The system is used to conduct extensiveexperimentation real data to test the effectiveness ofvarious feature vectors for shape - the first suchcomparison of this type. The test results show that thedescending order of the average precision of featurevectors is: principal moments, moment invariants,geometric parameters, and eigenvalues. In addition, amulti-step similarity search strategy is proposed andtested in this paper to improve the effectiveness of 3Dengineering shape search. It is shown that the multi-stepapproach is more effective than the one-shotsearch approach, when a fixed number of shapes areretrieved. ICDE Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation. Bertram Ludäscher,Alan Nash 2004 Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation. ICDE Function Proxy: Template-Based Proxy Caching for Table-Valued Functions. Qiong Luo,Wenwei Xue 2004 Function Proxy: Template-Based Proxy Caching for Table-Valued Functions. ICDE GODIVA: Lightweight Data Management for Scientific Visualization Applications. Xiaosong Ma,Marianne Winslett,John Norris,Xiangmin Jiao,Robert Fiedler 2004 "Scientific visualization applications are very data-intensive,with high demands for I/O and data management.Developers of many visualization tools hesitate to use traditionalDBMSs, due to the lack of support for these DBMSson parallel platforms and the risk of reducing the portabilityof their tools and the user data. In this paper, we proposethe GODIVA framework, which provides simple database-likeinterfaces to help visualization tool developers managetheir in-memory data, and I/O optimizations such asprefetching and caching to improve input performance atrun time. We implemented the GODIVA interfaces in astand-alone, portable user library, which can be used by alltypes of visualization codes: interactive and batch-mode,sequential and parallel. Performance results from runninga visualization tool using the GODIVA library on multipleplatforms show that the GODIVA framework is easy to use,alleviates developers' data management burden, and canbring substantial I/O performance improvement." ICDE t-Synopses: A System for Run-Time Management of Remote Synopses. Yossi Matias,Leon Portman 2004 t-Synopses: A System for Run-Time Management of Remote Synopses. ICDE Nested Queries and Quantifiers in an Ordered Context. Norman May,Sven Helmer,Guido Moerkotte 2004 We present algebraic equivalences that allow to unnestnested algebraic expressions for order-preserving algebraicoperators. We illustrate how these equivalences canbe applied successfully to unnest nested queries given inthe XQuery language. Measurements illustrate the performancegains possible by unnesting. ICDE Priority Mechanisms for OLTP and Transactional Web Applications. David T. McWherter,Bianca Schroeder,Anastassia Ailamaki,Mor Harchol-Balter 2004 "Transactional workloads are a hallmark of modernOLTP and Web applications, ranging from electronic commerceand banking to online shopping. Often, the databaseat the core of these applications is the performance bottleneck.Given the limited resources available to the database,transaction execution times can vary wildly as they competeand wait for critical resources. As the competitor is ""only aclick away,"" valuable (high-priority) users must be ensuredconsistently good performance via QoS and transaction prioritization.This paper analyzes and proposes prioritization fortransactional workloads in traditional database systems(DBMS). This work first performs a detailed bottleneckanalysis of resource usage by transactional workloads oncommercial and noncommercial DBMS (IBM DB2, PostgreSQL,Shore) under a range of configurations. Second,this work implements and evaluates the performance of severalpreemptive and non-preemptive DBMS prioritizationpolicies in PostgreSQL and Shore. The primary contributionsof this work include (i) understanding the bottleneckresources in transactional DBMS workloads and (ii) ademonstration that prioritization in traditional DBMS canprovide 2x-5x improvement for high-priority transactionsusing simple scheduling policies, without expense to low-prioritytransactions." ICDE Scalable Multimedia Disk Scheduling. Mohamed F. Mokbel,Walid G. Aref,Khaled M. Elbassioni,Ibrahim Kamel 2004 A new multimedia disk scheduling algorithm, termedCascaded-SFC, is presented. The Cascaded-SFC multimediadisk scheduler is applicable in environments where multimediadata requests arrive with different quality of service(QoS) requirements such as real-time deadline and user priority.Previous work on disk scheduling has focused on optimizingthe seek times and/or meeting the real-time deadlines.The Cascaded-SFC disk scheduler provides a unifiedframework for multimedia disk scheduling that scaleswith the number of scheduling parameters. The generalidea is based on modeling the multimedia disk requestsas points in multiple multi-dimensional sub-spaces, whereeach of the dimensions represents one of the parameters(e.g., one dimension represents the request deadline, anotherrepresents the disk cylinder number, and a third dimensionrepresents the priority of the request, etc.). Eachmulti-dimensional sub-space represents a subset of the QoSparameters that share some common scheduling characteristics.Then the multimedia disk scheduling problem reducesto the problem of finding a linear order to traversethe multi-dimensional points in each sub-space. Multiplespace-filling curves are selected to fit the scheduling needsof the QoS parameters in each sub-space. The orders ineach sub-space are integrated in a cascaded way to providea total order for the whole space. Comprehensive experimentsdemonstrate the efficiency and scalability of theCascaded-SFC disk scheduling algorithm over other diskschedulers. ICDE Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results. Mohamed F. Mokbel,Ming Lu,Walid G. Aref 2004 This paper introduces the hash-merge join algorithm(HMJ, for short); a new non-blocking join algorithm thatdeals with data items from remote sources via unpredictable,slow, or bursty network traffic. The HMJ algorithmis designed with two goals in mind: (1) Minimize thetime to produce the first few results, and (2) Produce joinresults even if the two sources of the join operator occasionallyget blocked. The HMJ algorithm has two phases: Thehashing phase and the merging phase. The hashing phaseemploys an in-memory hash-based join algorithm that producesjoin results as quickly as data arrives. The mergingphase is responsible for producing join results if the twosources are blocked. Both phases of the HMJ algorithmare connected via a flushing policy that flushes in-memoryparts into disk storage once the memory is exhausted. Experimentalresults show that HMJ combines the advantagesof two state-of-the-art non-blocking join algorithms (XJoinand Progressive Merge Join) while avoiding their short-comings. ICDE A Machine Learning Approach to Rapid Development of XML Mapping Queries. Atsuyuki Morishima,Hiroyuki Kitagawa,Akira Matsumoto 2004 This paper presents XLearner, a novel tool that helpsthe rapid development of XML mapping queries writtenin XQuery. XLearner is novel in that it learns XQueryqueries consistent with given examples (fragments) of intendedquery results. XLearner combines known learningtechniques, incorporates mechanisms to cope with issuesspecific to the XQuery learning context, and provides a systematicway for the semi-automatic development of queries.This paper describes the XLearner system. It presents algorithmsfor learning various classes of XQuery, shows thata minor extension gives the system a practical expressivepower, and reports experimental results to demonstrate howXLearner outputs reasonably complicated queries with onlya small number of interactions with the user. ICDE A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiqing Nie,Subbarao Kambhampati 2004 Query optimization in data integration requires source coverageand overlap statistics.Gathering and storing the requiredstatistics presents many challenges, not the least of which is controllingthe amount of statistics learned.In this paper we introduceStatMiner, a novel statistics mining approach which automaticallygenerates attribute value hierarchies, efficiently discoversfrequently accesses query classes based on the learned attributevalue hierarchies, and learns statistics only with respect to theseclasses.We describe the details of our method, and present experimentalresults demonstrating the efficiency and effectiveness of ourapproach.Our experiments are done in the context of BibFinder,a publicly fielded bibliography mediator. ICDE Adapting a Generic Match Algorithm to Align Ontologies of Human Anatomy. Peter Mork,Philip A. Bernstein 2004 The difficulty inherent in schema matching has ledto the development of several generic match algorithms.This paper describes how we adapted generalapproaches to the specific task of aligning two ontologiesof human anatomy, the Foundational Model ofAnatomy and the GALEN Common Reference Model.Our approach consists of three phases: lexical, structuraland hierarchical, which leverage different aspectsof the ontologies as they are represented in ageneric meta-model. Lexical matching identifies conceptswith similar names. Structural matching identifiesconcepts whose neighbors are similar. Finally,hierarchical matching identifies concepts with similardescendants. We conclude by reporting on the lessonswe learned. ICDE Bitmap-Tree Indexing for Set Operations on Free Text. Ilias Nitsos,Georgios Evangelidis,Dimitrios Dervos 2004 In the present study we report on our implementation ofa hybrid-indexing scheme (Bitmap-Tree) that combines theadvantages of bitmap indexing and file inversion. The resultswe obtained are compared to those of the compressedinverted file index. Both storage overhead and query processingefficiency are taken into consideration. The proposednew method is shown to excel in handling queriesinvolving set operations. For general-purpose user queries,the Bitmap-Tree is shown to perform as good as the compressedinverted file index. ICDE Scaling Clustering Algorithms for Massive Data Sets using Data Streams. Silvia Nittel,Kelvin T. Leung,Amy Braverman 2004 Scaling Clustering Algorithms for Massive Data Sets using Data Streams. ICDE Superimposed Applications using SPARCE. Sudarshan Murthy,David Maier,Lois M. L. Delcambre,Shawn Bowers 2004 Superimposed Applications using SPARCE. ICDE SPINE: Putting Backbone into String Indexing. Naresh Neelapala,Romil Mittal,Jayant R. Haritsa 2004 The indexing technique commonly used for long strings,such as genomes, is the suffix tree, which is based on a vertical(intra-path) compaction of the underlying trie structure.In this paper, we investigate an alternative approach to indexbuilding, based on horizontal (inter-path) compactionof the trie. In particular, we present SPINE, a carefully engineeredhorizontally-compacted trie index. SPINE consistsof a backbone formed by a linear chain of nodes representingthe underlying string, with the nodes connected by arich set of edges for facilitating fast forward and backwardtraversals over the backbone during index construction andquery search. A special feature of SPINE is that it collapsesthe trie into a linear structure, representing the logical extremeof horizontal compaction.We describe algorithms for SPINE construction and forsearching this index to find the occurrences of query patterns.Our experimental results on a variety of real genomicand proteomic strings show that SPINE requires significantlyless space than standard implementations of suffixtrees. Further, SPINE takes lesser time for both constructionand search as compared to suffix trees, especially whenthe index is disk-resident. Finally, the linearity of its structuremakes it more amenable for integration with databaseengines. ICDE Can A Semantic Web for Life Sciences Improve Drug Discovery? Eric K. Neumann 2004 Can A Semantic Web for Life Sciences Improve Drug Discovery? ICDE An Efficient Framework for Order Optimization. Thomas Neumann,Guido Moerkotte 2004 Since the introduction of cost-based query optimization,the performance-critical role of interesting orders has beenrecognized. Some algebraic operators change interestingorders (e.g. sort and select), while others exploit interesting orders (e.g. merge join). The two operations performed by any query optimizer during plan generation are 1) computing the resulting order given an input order and an algebraic operator and 2) determining the compatibility between a given input order and the required order a given algebraic operator can beneficially exploit. Since these twooperations are called millions of times during plan generation, they are highly performance-critical. The third crucial parameter is the space requirement for annotating every plan node with its output order.Lately, a powerful framework for reasoning about ordershas been developed, which is based on functional dependencies. Within this framework, the current state-of-the-art algorithms for implementing the above operations both havea lower bound time requirement of ¿(n), where n is thenumber of functional dependencies involved. Further, thelower bound for the space requirement for every plan nodeis ¿(n).We improve these bounds by new algorithms with uppertime bounds O(1). That is, our algorithms for both operations work in constant time during plan generation, after a one-time preparation step. Further, the upper bound for thespace requirement for plan nodes is O(1) for our approach.Besides, our algorithm reduces the search space by detecting and ignoring irrelevant orderings. Experimental results with a full fledged query optimizer show that our approachsignificantly reduces the total time needed for plan generation. As a corollary of our experiments, it follows that thetime spent for order processing is a non-negligible part ofplan generation. ICDE Online Amnesic Approximation of Streaming Time Series. Themistoklis Palpanas,Michail Vlachos,Eamonn J. Keogh,Dimitrios Gunopulos,Wagner Truppel 2004 The past decade has seen a wealth of research on time series representations, because the manipulation, storage, andindexing of large volumes of raw time series data is impractical. The vast majority of research has concentrated on representations that are calculated in batch mode and representeach value with approximately equal fidelity. However, the increasing deployment of mobile devices and real time sensorshas brought home the need for representations that can beincrementally updated, and can approximate the data with fidelity proportional to its age. The latter property allows us toanswer queries about the recent past with greater precision,since in many domains recent information is more useful thanolder information. We call such representations amnesic.While there has been previous work on amnesic representations, the class of amnesic functions possible was dictatedby the representation itself. In this work, we introduce anovel representation of time series that can represent arbitrary, user-specified amnesic functions. For example, a meteorologist may decide that data that is twice as old can toleratetwice as much error, and thus, specify a linear amnesic function. In contrast, an econometrist might opt for an exponentialamnesic function. We propose online algorithms for our representation, and discuss their properties. Finally, we performan extensive empirical evaluation on 40 datasets, and showthat our approach can efficiently maintain a high quality amnesicapproximation. ICDE Authenticating Query Results in Edge Computing. HweeHwa Pang,Kian-Lee Tan 2004 Edge computing pushes application logic and the underlyingdata to the edge of the network, with the aim of improvingavailability and scalability. As the edge servers arenot necessarily secure, there must be provisions for validatingtheir outputs. This paper proposes a mechanism thatcreates a verification object (VO) for checking the integrityof each query result produced by an edge server - that valuesin the result tuples are not tampered with, and that nospurious tuples are introduced. The primary advantages ofour proposed mechanism are that the VO is independent ofthe database size, and that relational operations can stillbe fulfilled by the edge servers. These advantages reducetransmission load and processing at the clients. We alsoshow how insert and delete transactions can be supported. ICDE Group Nearest Neighbor Queries. Dimitris Papadias,Qiongmao Shen,Yufei Tao,Kyriakos Mouratidis 2004 Given two sets of points P and Q, a group nearest neighbor(GNN) query retrieves the point(s) of P with the smallestsum of distances to all points in Q. Consider, for instance,three users at locations q1, q2 and q3 that want to find a meeting point (e.g., a restaurant); the corresponding queryreturns the data point p that minimizes the sum of Euclideandistances |pqi| for 1¿i ¿3. Assuming that Q fits in memoryand P is indexed by an R-tree, we propose severalalgorithms for finding the group nearest neighborsefficiently. As a second step, we extend our techniques forsituations where Q cannot fit in memory, covering bothindexed and non-indexed query points. An experimentalevaluation identifies the best alternative based on the dataand query properties. ICDE Integrating XML Data in the TARGITOLAP System. Dennis Pedersen,Jesper Pedersen,Torben Bach Pedersen 2004 "This paper presents work on logical integration of OLAPand XML data sources, carried out in cooperation betweenTARGIT, a Danish OLAP client vendor, and AalborgUniversity. A prototype has been developed that allowsXML data on the WWW to be used as dimensions and measuresin the OLAP system in the same way as ordinary dimensionsand measures, providing a powerful and flexibleway to handle unexpected or short-term data requirementsas well as rapidly changing data. Compared to earlier work,this paper presents several major extensions that resultedfrom TARGIT's requirements. These include the ability touse XML data as measures, as well as a novel multigranulardata model and query language that formalizes and extendsthe TARGIT data model and query language." ICDE Data Mining for Intrusion Detection: Techniques, Applications and Systems. Jian Pei,Shambhu J. Upadhyaya,Faisal Farooq,Venugopal Govindaraju 2004 Data Mining for Intrusion Detection: Techniques, Applications and Systems. ICDE Selectivity Estimation for XML Twigs. Neoklis Polyzotis,Minos N. Garofalakis,Yannis E. Ioannidis 2004 Twig queries represent the building blocks of declarativequery languages over XML data. A twig query describesa complex traversal of the document graph and generatesa set of element tuples based on the intertwined evaluation(i.e., join) of multiple path expressions. Estimatingthe result cardinality of twig queries or, equivalently, thenumber of tuples in such a structural (path-based) join, isa fundamental problem that arises in the optimization ofdeclarative queries over XML. It is crucial, therefore, to developconcise synopsis structures that summarize the documentgraph and enable such selectivity estimates within thetime and space constraints of the optimizer. In this paper,we propose novel summarization and estimation techniquesfor estimating the selectivity of twig queries with complexXPath expressions over tree-structured data. Our approachis based on the XSKETCH model, augmented with new typesof distribution information for capturing complex correlationpatterns across structural joins. Briefly, the key ideais to represent joins as points in a multidimensional spaceof path counts that capture aggregate information on thecontents of the resulting element tuples. We develop a systematicframework that combines distribution informationwith appropriate statistical assumptions in order to provideselectivity estimates for twig queries over concise XS-KETCHsynopses and we describe an efficient algorithm forconstructing an accurate summary for a given space budget.Implementation results with both synthetic and real-lifedata sets verify the effectiveness of our approach anddemonstrate its benefits over earlier techniques. ICDE PRIX: Indexing And Querying XML Using Prüfer Sequences. Praveen Rao,Bongki Moon 2004 PRIX: Indexing And Querying XML Using Prüfer Sequences. ICDE Information Lifecycle Management: The EMC Perspective. David Reiner,Gil Press,Mike Lenaghan,David Barta,Rich Urmston 2004 "Information is a strategic component of modernbusiness, and its effective management has become acritical business challenge. Electronic information hasnot only been growing in volume at unprecedentedrates, its value to business has never been greater.Around-the-clock operations, electronic commerce,corporate governance rules and legally-mandatedretention laws have all added to the pressure for betterinformation management. Information LifecycleManagement (ILM) is a business-centric strategy forproactive management of information throughout itslife, from its creation and use to its ultimate disposal.EMC's ILM initiative is enhancing our customers'information management capabilities through dataclassification, centralized management, automation,product integration, and policy-based management.Research issues related to ILM include informationclassification and optimization of policy-basedinformation management." ICDE From Sipping on a Straw to Drinking from a Fire Hose: Data Integration in a Public Genome Database. Joel E. Richardson,James A. Kadin,Judith A. Blake,Carol J. Bult,Janan T. Eppig,Martin Ringwald 2004 Biology is a vast domain. The Mouse GenomeInformatics (MGI) system, which focuses on thebiology of the laboratory mouse, covers only a small,carefully chosen slice. Nevertheless, we deal with dataof immense variety, deep complexity, and exponentiallygrowing volume. Our role as an integration nexus is toadd value by combining data sets of diverse types andorigins, eliminating redundancy and resolvingconflicts. In this paper, we briefly describe some of theissues we face and approaches we have adopted to theintegration problem. ICDE A Type-Safe Object-Oriented Solution for the Dynamic Construction of Queries. Peter Rosenthal 2004 A Type-Safe Object-Oriented Solution for the Dynamic Construction of Queries. ICDE A Peer-to-peer Framework for Caching Range Queries. Ozgur D. Sahin,Abhishek Gupta,Divyakant Agrawal,Amr El Abbadi 2004 Peer-to-peer systems are mainly used for object sharingalthough they can provide the infrastructure for manyother applications. In this paper, we extend the idea of objectsharing to data sharing on a peer-to-peer system. Wepropose a method, which is based on the multidimensionalCAN system, for efficiently evaluating range queries. Theanswers of the range queries are cached at the peers andare used to answer future range queries. The scalabilityand efficiency of our design is shown through simulation. ICDE Unordered Tree Mining with Applications to Phylogeny. Dennis Shasha,Jason Tsong-Li Wang,Sen Zhang 2004 Frequent structure mining (FSM) aims to discover andextract patterns frequently occuring in structural data,such as trees and graphs.FSM finds many applications inbioinformatics, XML processing, Web log analysis, and soon.In this paper we present a new FSM technique for findingpatterns in rooted unordered labeled trees.The patternsof interest are cousin pairs in these trees.A cousin pair isa pair of nodes sharing the same parent, the same grand-parent,or the same great-grandparent, etc.Given a treeT, our algorithm finds all interesting cousin pairs of T inO(|T|2) time when |T| is the number of nodes in T.Experimentalresults on synthetic data and phylogenies showthe scalability and effectiveness of the proposed technique.To demonstrate the usefulness of our approach, we discussits applications to locating co-occurring patterns in multipleevolutionary trees, evaluating the consensus of equallyparsimonious trees, and finding kernel trees of groups ofphylogenies.We also describe extensions of our algorithmsfor undirected acyclic graphs (or free trees). ICDE NEXSORT: Sorting XML in External Memory. Adam Silberstein,Jun Yang 2004 "XML plays an important role in delivering data over theInternet, and the need to store and manipulate XML in itsnative format has become increasingly relevant. This growingneed necessitates work on developing native XML operators,especially for one as fundamental as sort. In this paperwe present NEXSORT, an algorithm that leverages thehierarchical nature of XML to efficiently sort an XML documentin external memory. In a fully sorted XML document,children of every non-leaf element are ordered accordingto a given sorting criterion. Among NEXSORT's uses is incombination with structural merge as the XML version ofsort-merge join, which allows us to merge large XML documentsusing only a single pass once they are sorted.The hierarchical structure of an XML document limitsthe number of possible legal orderings among itselements, which means that sorting XML is fundamentally""easier"" than sorting a flat file. We prove thatthe I/O lower bound for sorting XML in external memoryis ¿(max {n, n logm(k/B)}), where is the numberof blocks in the input XML document, m is the numberof main memory blocks available for sorting, B is the numberof elements that can fit in one block, and k is the maximumfan-out of the input document tree. We show thatNEXSORT performs within a constant factor of this theoreticallower bound. In practice we demonstrate, evenwith a naive implementation, NEXSORT significantly outperformsa regular external merge sort of all elements bytheir key paths, unless the XML document is nearly flat,in which case NEXSORT degenerates essentially to externalmerge sort." ICDE Proving Ownership over Categorical Data. Radu Sion 2004 "This paper introduces a novel method of rightsprotection for categorical data through watermarking.We discover new watermark embedding channelsfor relational data with categorical types. Wedesign novel watermark encoding algorithms andanalyze important theoretical bounds including markvulnerability. While fully preserving data qualityrequirements, our solution survives important attacks,such as subset selection and random alterations. Markdetection is fully ""blind"" in that it doesn't require theoriginal data, an important characteristic especiallyin the case of massive data. We propose variousimprovements and alternative encoding methods. Weperform validation experiments by watermarking theoutsourced Wal-Mart sales data available at ourinstitute. We prove (experimentally and by analysis)our solution to be extremely resilient to both alterationand data loss attacks, for example tolerating up to 80%data loss with a watermark alteration of only 25%." ICDE wmdb.: Rights Protection for Numeric Relational Data. Radu Sion,Mikhail J. Atallah,Sunil Prabhakar 2004 wmdb.: Rights Protection for Numeric Relational Data. ICDE Outrageous Ideas and/or Thoughts While Shaving. Michael Stonebraker 2004 Outrageous Ideas and/or Thoughts While Shaving. ICDE Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks. Torsten Suel,Patrick Noel,Dimitre Trendafilov 2004 We study the problem of maintaining large replicated collectionsof files or documents in a distributed environment withlimited bandwidth. This problem arises in a number of importantapplications, such as synchronization of data betweenaccounts or devices, content distibution and web caching networks,web site mirroring, storage networks, and large scaleweb search and mining. At the core of the problem lies thefollowing challenge, called the file synchronization problem:given two versions of a file on different machines, say an outdatedand a current one, how can we update the outdatedversion with minimum communication cost, by exploiting thesignificant similarity between the versions? While a popularopen source tool for this problem called rsync is used in hundredsof thousands of installations, there have been only veryfew attempts to improve upon this tool in practice.In this paper, we propose a framework for remote file synchronizationand describe several new techniques that resultin significant bandwidth savings. Our focus is on applicationswhere very large collections have to be maintainedover slow connections. We show that a prototype implementationof our framework and techniques achieves significantimprovements over rsync. As an example application, we focuson the efficient synchronization of very large web pagecollections for the purpose of search, mining, and contentdistribution. ICDE Querying about the Past, the Present, and the Future in Spatio-Temporal. Jimeng Sun,Dimitris Papadias,Yufei Tao,Bin Liu 2004 Querying about the Past, the Present, and the Future in Spatio-Temporal. ICDE Privacy Preservation for Data Cubes. Sam Yuan Sung,Yao Liu,Peter A. Ng 2004 Privacy Preservation for Data Cubes. ICDE Substructure Clustering on Sequential 3d Object Datasets. Zhenqiang Tan,Anthony K. H. Tung 2004 In this paper, we will look at substructure clustering ofsequentail 3d objects.A sequential 3d object is a set ofpoints located in a three dimensional space that are linkedup to form a sequence.Given a set of sequential 3d objects,our aim is to find significantly large substructures whichare present in many of the sequential 3d objects.Unliketraditional subspace clustering methods in which objectsare compared based on values in the same dimension, thematching dimensions between two 3d sequential objects areaffected by both the translation and rotation of the objectsand are thus not well defined.Instead, similarity betweenthe objects are judge by computing a structural distancemeasurement call rmsd(Root Mean Square Distance)which require proper alignment (including translation androtation) of the objects.As the computation of rmsd isexpensive, we proposed a new measure call ald(AngelLength Distance) which is shown experimentally to approximatermsd.Based on ald, we define a new clusteringmodel called sCluster and devise an algorithm for discoveringall maximum sCluster in a 3d sequentail dataset.Experiments are conducted to illustrate the efficiency andeffectiveness of our algorithm. ICDE Spatio-Temporal Aggregation Using Sketches. Yufei Tao,George Kollios,Jeffrey Considine,Feifei Li,Dimitris Papadias 2004 Several spatio-temporal applications require the retrievalof summarized information about moving objects that liein a query region during a query interval (e.g., the numberof mobile users covered by a cell, traffic volume in adistrict, etc.). Existing solutions have the distinct countingproblem: if an object remains in the query region forseveral timestamps during the query interval, it will becounted multiple times in the result. The paper solves thisproblem by integrating spatio-temporal indexes withsketches, traditionally used for approximate queryprocessing. The proposed techniques can also be appliedto reduce the space requirements of conventional spatio-temporaldata and to mine spatio-temporal association rules. ICDE Approximate Temporal Aggregation. Yufei Tao,Dimitris Papadias,Christos Faloutsos 2004 Temporal aggregate queries retrieve summarizedinformation about records with time-evolving attributes.Existing approaches have at least one of the followingshortcomings: (i) they incur large space requirements, (ii)they have high processing cost and (iii) they are based oncomplex structures, which are not available in commercialsystems. In this paper we solve these problems byapproximation techniques with bounded error. Wepropose two methods: the first one is based on multi-versionB-trees and has logarithmic worst-case query cost,while the second technique uses off-the-shelf B- and R-trees,and achieves the same performance in the expectedcase. We experimentally demonstrate that the proposedmethods consume an order of magnitude less space thantheir competitors and are significantly faster, even forcases that the permissible error bound is very small. ICDE Mining Frequent Labeled and Partially Labeled Graph Patterns. Natalia Vanetik,Ehud Gudes 2004 "Whereas data mining in structured data focuses on frequentdata values, in semi-structured and graph data theemphasis is on frequent labels and common topologies.Here, the structure of the data is just as important as itscontent.When data contains large amount of differentlabels, both fully labeled and partially data maybe useful.More informative patterns can be found in thedatabase if some of the pattern nodes can be regarded as'unlabeled'.We study the problem of discovering typicalfully and partially labeled patterns of graph data.Discovered patterns are useful in many applications, including:compact representation of source informationand a road-map for browsing and querying informationsources." ICDE ToMAS: A System for Adapting Mappings while Schemas Evolve. Yannis Velegrakis,Renée J. Miller,Lucian Popa,John Mylopoulos 2004 ToMAS: A System for Adapting Mappings while Schemas Evolve. ICDE Dynamic Extensible Query Processing in Super-Peer Based P2P Systems. Christian Wiesner,Alfons Kemper,Stefan Brandl 2004 Dynamic Extensible Query Processing in Super-Peer Based P2P Systems. ICDE BIDE: Efficient Mining of Frequent Closed Sequences. Jianyong Wang,Jiawei Han 2004 Previous studies have presented convincing argumentsthat a frequent pattern mining algorithm should not mineall frequent patterns but only the closed ones because thelatter leads to not only more compact yet complete resultset but also better efficiency. However, most of the previouslydeveloped closed pattern mining algorithms work underthe candidate maintenance-and-test paradigm which isinherently costly in both runtime and space usage when thesupport threshold is low or the patterns become long.In this paper, we present, BIDE, an efficient algorithmfor mining frequent closed sequences without candidatemaintenance. It adopts a novel sequence closure checkingscheme called BI-Directional Extension, and prunes thesearch space more deeply compared to the previous algorithmsby using the BackScan pruning method and the Scan-Skipoptimization technique. A thorough performance studywith both sparse and dense real-life data sets has demonstratedthat BIDE significantly outperforms the previous algorithms:it consumes order(s) of magnitude less memoryand can be more than an order of magnitude faster. It isalso linearly scalable in terms of database size. ICDE A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. Xiaodong Wu,Mong-Li Lee,Wynne Hsu 2004 Efficient evaluation of XML queries requires thedetermination of whether a relationship exists betweentwo elements. A number of labeling schemes have beendesigned to label the element nodes such that therelationships between nodes can be easily determinedby comparing their labels. With the increasedpopularity of XML on the web, finding a labelingscheme that is able to support order-sensitive queriesin the presence of dynamic updates becomes urgent. Inthis paper, we propose a new labeling scheme thattakes advantage of the unique property of primenumbers to meet this need. The global order of thenodes can be captured by generating simultaneouscongruence values from the prime number node labels.Theoretical analysis of the label size requirements forthe various labeling schemes is given. Experimentresults indicate that the prime number labeling schemeis compact compared to existing dynamic labelingschemes, and provides efficient support to order-sensitivequeries and updates. ICDE Extending XML Database to Support Open XML. Jinyu Wang,Kongyi Zhou,K. Karun,Mark Scardina 2004 XML is a widely accepted standard for exchangingbusiness data. To optimize the management of XMLand help companies build up their business partnernetworks over the Internet, database servers haveintroduced new XML storage and query features.However, each enterprise defines its own dataelements in XML and modifies the XML documents tohandle the evolving business needs. This makes XMLdata conform to heterogeneous schemas or schemasthat evolve over time, which is not suitable for XMLdatabase storage. This paper provides an overview ofthe current XML database strategies and presents astreaming metadata-processing approach, enablingdatabases to handle multiple XML formats seamlessly. ICDE Direct Mesh: a Multiresolution Approach to Terrain Visualization. Kai Xu,Xiaofang Zhou,Xuemin Lin 2004 Terrain can be approximated by a triangular mesh consistingmillions of 3D points. Multiresolution triangularmesh (MTM) structures are designed to support applicationsthat use terrain data at variable levels of detail (LOD).Typically, an MTM adopts a tree structure where a parentnode represents a lower-resolution approximation of its descendants.Given a region of interest (ROI) and a LOD,the process of retrieving the required terrain data from thedatabase is to traverse the MTM tree from the root to reachall the nodes satisfying the ROI and LOD conditions. Thisprocess, while being commonly used for multiresolution terrainvisualization, is inefficient as either a large numberof sequential I/O operations or fetching a large amount ofextraneous data is incurred. Various spatial indexes havebeen proposed in the past to address this problem, howeverlevel-by-level tree traversal remains a common practice inorder to obtain topological information among the retrievedterrain data. In this paper, a new MTM data structure calleddirect mesh is proposed. We demonstrate that with directmesh the amount of data retrieval can be substantially reduced.Comparing with existing MTM indexing methods,a significant performance improvement has been observedfor real-life terrain data. ICDE Benchmarking SAP R/3 Archiving Scenarios. Bernhard Zeller,Alfons Kemper 2004 According to a survey of the University of Berkeley,about 5 Exabytes of new information has been created in2002. This information explosion affects also the databasevolumes of enterprise resource planning (ERP) systems likeSAP R/3, the market leader for ERP systems. Just like theoverall information explosion, the database volumes of ERPsystems are growing at a tremendous rate and some of themhave reached a size of several Terabytes. OLTP (OnlineTransaction Processing) databases of this size are hard tomaintain and tend to perform poorly. One way to limit thesize of a database is data staging, i.e., to make use of anSAP technique called archiving. That is, data which arenot needed for every-day operations are demoted from thedatabase (disks) to tertiary storage (tapes). In cooperationwith our research group, SAP is adapting their archivingtechniques to accelerate the archiving process by integratingnew technologies like XML and advanced database features.However, so far no benchmark existed to evaluate differentarchiving scenarios and to measure the impact of a changein the archiving technique. We therefore designed and implementeda generic benchmark which is applicable to manydifferent system layouts and allows the users to evaluate variousarchiving scenarios. ICDE A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. Ning Zhang,Varun Kacholia,M. Tamer Özsu 2004 Path expressions are ubiquitous in XML processing languages.Existing approaches evaluate a path expression byselecting nodes that satisfies the tag-name and value constraintsconstraints. In this paper, we propose a novel approach,and then joining them according to the structuralnext-of-kin (NoK) pattern matching, to speed up the node-selectionstep, and to reduce the join size significantly in thesecond step. To efficiently perform NoK pattern matching,we also propose a succinct XML physical storage schemethat is adaptive to updates and streaming XML as well. Ourperformance results demonstrate that the proposed storagescheme and path evaluation algorithm is highly efficient andoutperforms the other tested systems in most cases. ICDE Making the Pyramid Technique Robust to Query Types and Workloads. Rui Zhang,Beng Chin Ooi,Kian-Lee Tan 2004 The effectiveness of many existing high-dimensional indexingstructures is limited to specific types of queries andworkloads. For example, while the Pyramid technique andthe iMinMax are efficient for window queries, the iDistanceis superior for kNN queries. In this paper, we present anew structure, called the P+-tree, that supports both windowqueries and kNN queries under different workloads efficiently.In the P+-tree, a B+-tree is employed to indexthe data points as follows. The data space is partitionedinto subspaces based on clustering, and points in each subspaceare mapped onto a single dimensional space using thePyramid technique, and stored in the B+-tree. The crux ofthe scheme lies in the transformation of the data which hastwo crucial properties. First, it maps each subspace intoa hypercube so that the Pyramid technique can be applied.Second, it shifts the cluster center to the top of the pyramid,which is the case that the Pyramid technique worksvery efficiently. We present window and kNN query processingalgorithms for the P+-tree. Through an extensiveperformance study, we show that the P+-tree has considerablespeedup over the Pyramid technique and the iMinMaxfor window queries and outperforms the iDistance for kNN queries. ICDE Data Management in Location-Dependent Information Services. Baihua Zheng,Jianliang Xu,Wang-Chien Lee 2004 Data Management in Location-Dependent Information Services. ICDE XBench Benchmark and Performance Testing of XML DBMSs. Benjamin Bin Yao,M. Tamer Özsu,Nitin Khandelwal 2004 XML support is being added to existing database managementsystems (DBMSs) and native XML systems are beingdeveloped both in industry and in academia. The individualperformance characteristics of these approachesas well as the relative performance of various systems isan ongoing concern. In this paper we discuss the XBenchXML benchmark and report on the relative performance ofvarious DBMSs. XBench is a family of XML benchmarkswhich recognizes that the XML data that DBMSs manageare quite varied and no one database schema and workloadcan properly capture this variety. Thus, the members of thisbenchmark family have been defined for capturing diverseapplication domains. ICDE GenExplore: Interactive Exploration of Gene Interactions from Microarray Data. Yong Ye,Xintao Wu,Kalpathi R. Subramanian,Liying Zhang 2004 DNA Microarray provides a powerful basis for analysisof gene expression. Data mining methods such as clusteringhave been widely applied to microarray data to link genesthat show similar expression patterns. However, this approachusually fails to unveil gene-gene interactions in thesame cluster. In this project, we propose to combine graphicalmodel based interaction analysis with other data miningtechniques (e.g., association rule, hierarchical clustering)for this purpose. For interaction analysis, we propose theuse of Graphical Gaussian Modelto discover pairwise geneinteractions and loglinear model to discover multi-gene interactions.We have constructed a prototype system that permitsrapid interactive exploration of gene relationships. ICDE Hiding Data Accesses in Steganographic File System. Xuan Zhou,HweeHwa Pang,Kian-Lee Tan 2004 To support ubiquitous computing, the underlying datahave to be persistent and available anywhere-anytime. Thedata thus have to migrate from devices local to individualcomputers, to shared storage volumes that are accessibleover open network. This potentially exposes the datato heightened security risks. We propose two mechanisms,in the context of a steganographic file system, to mitigatethe risk of attacks initiated through analyzing data accessesfrom user applications. The first mechanism is intended tocounter attempts to locate data through updates in betweensnapshots - in short, update analysis. The second mechanismprevents traffic analysis - identifying data from I/Otraffic patterns. We have implemented the first mechanismon Linux and conducted experiments to demonstrate its effectivenessand practicality. Simulation results on the secondmechanism also show its potential for real world applications. ICDE CrossMine: Efficient Classification Across Multiple Database Relations. Xiaoxin Yin,Jiawei Han,Jiong Yang,Philip S. Yu 2004 "Most of today's structured data is stored in relationaldatabases. Such a database consists of multiplerelations which are linked together conceptually viaentity-relationship links in the design of relational databaseschemas. Multi-relational classification can be widelyused in many disciplines, such as financial decision making,medical research, and geographical applications.However, most classification approaches only work on single""flat"" data relations. It is usually difficult to convertmultiple relations into a single flat relation without eitherintroducing huge, undesirable ""universal relation"" orlosing essential information. Previous works using InductiveLogic Programming approaches (recently also knownas Relational Mining) have proven effective with high accuracyin multi-relational classification. Unfortunately,they suffer from poor scalability w.r.t. the number of relationsand the number of attributes in databases.In this paper we propose CrossMine, an efficientand scalable approach for multi-relational classification.Several novel methods are developed in CrossMine,including (1) tuple ID propagation, which performssemantics-preserving virtual join to achieve high efficiencyon databases with complex schemas, and (2) a selectivesampling method, which makes it highly scalablew.r.t. the number of tuples in the databases. Both theoreticalbackgrounds and implementation techniques ofCrossMine are introduced. Our comprehensive experimentson both real and synthetic databases demonstratethe high scalability and accuracy of CrossMine." ICDE Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, Boston, MA, USA 2004 Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, Boston, MA, USA SIGMOD Conference Information-Theoretic Tools for Mining Database Structure from Large Data Sets. Periklis Andritsos,Renée J. Miller,Panayiotis Tsaparas 2004 Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets. SIGMOD Conference The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree. Lars Arge,Mark de Berg,Herman J. Haverkort,Ke Yi 2004 We present the Priority R-tree, or PR-tree, which is the first R-tree variant that always answers a window query using O((N/B)1 1/d + T/B) I/Os, where N is the number of d-dimensional (hyper-) rectangles stored in the R-tree, B is the disk block size, and T is the output size. This is provably asymptotically optimal and significantly better than other R-tree variants, where a query may visit all N/B leaves in the tree even when T = 0. We also present an extensive experimental study of the practical performance of the PR-tree using both real-life and synthetic data. This study shows that the PR-tree performs similar to the best known R-tree variants on real-life and relatively nicely distributed data, but outperforms them significantly on more extreme data. SIGMOD Conference Static Optimization of Conjunctive Queries with Sliding Windows Over Infinite Streams. Ahmed Ayad,Jeffrey F. Naughton 2004 "We define a framework for static optimization of sliding window conjunctive queries over infinite streams. When computational resources are sufficient, we propose that the goal of optimization should be to find an execution plan that minimizes resource usage within the available resource constraints. When resources are insufficient, on the other hand, we propose that the goal should be to find an execution plan that sheds some of the input load (by randomly dropping tuples) to keep resource usage within bounds while maximizing the output rate. An intuitive approach to load shedding suggests starting with the plan that would be optimal if resources were sufficient and adding ""drop boxes"" to this plan. We find this to be often times suboptimal - in many instances the optimal partial answer plan results from adding drop boxes to plans that are not optimal in the unlimited resource case. In view of this, we use our framework to investigate an approach to optimization that unifies the placement of drop boxes and the choice of the query plan from which to drop tuples. The effectiveness of our optimizer is experimentally validated and the results show the promise of this approach." SIGMOD Conference Adaptive Ordering of Pipelined Stream Filters. Shivnath Babu,Rajeev Motwani,Kamesh Munagala,Itaru Nishizawa,Jennifer Widom 2004 We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on the problem of ordering the filters adaptively to minimize processing cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We identify a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We develop a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms in the STREAM prototype Data Stream Management System and a thorough performance evaluation is presented. SIGMOD Conference Lazy Query Evaluation for Active XML. Serge Abiteboul,Omar Benjelloun,Bogdan Cautis,Ioana Manolescu,Tova Milo,Nicoleta Preda 2004 In this paper, we study query evaluation on Active XML documents (AXML for short), a new generation of XML documents that has recently gained popularity. AXML documents are XML documents whose content is given partly extensionally, by explicit data elements, and partly intensionally, by embedded calls to Web services, which can be invoked to generate data.A major challenge in the efficient evaluation of queries over such documents is to detect which calls may bring data that is relevant for the query execution, and to avoid the materialization of irrelevant information. The problem is intricate, as service calls may be embedded anywhere in the document, and service invocations possibly return data containing calls to new services. Hence, the detection of relevant calls becomes a continuous process. Also, a good analysis must take the service signatures into consideration.We formalize the problem, and provide algorithms to solve it. We also present an implementation that is compliant with XML and Web services standards, and is used as part of the ActiveXML system. Finally, we experimentally measure the performance gains obtained by a careful filtering of the service calls to be triggered. SIGMOD Conference Order-Preserving Encryption for Numeric Data. Rakesh Agrawal,Jerry Kiernan,Ramakrishnan Srikant,Yirong Xu 2004 Encryption is a well established technology for protecting sensitive data. However, once encrypted, data can no longer be easily queried aside from exact matches. We present an order-preserving encryption scheme for numeric data that allows any comparison operation to be directly applied on encrypted data. Query results produced are sound (no false hits) and complete (no false drops). Our scheme handles updates gracefully and new values can be added without requiring changes in the encryption of other values. It allows standard databse indexes to be built over encrypted tables and can easily be integrated with existing database systems. The proposed scheme has been designed to be deployed in application environments in which the intruder can get access to the encrypted database, but does not have prior domain information such as the distribution of values and annot encrypt or decrypt arbitrary values of his choice. The encryption is robust against estimation of the true value in such environments. SIGMOD Conference Hosting the .NET Runtime in Microsoft SQL Server. Alazel Acheson,Mason Bendixen,José A. Blakeley,Peter Carlin,Ebru Ersan,Jun Fang,Xiaowei Jiang,Christian Kleinerman,Balaji Rathakrishnan,Gideon Schaller,Beysim Sezgin,Ramachandran Venkatesh,Honggang Zhang 2004 "The integration of the .NET Common Language Runtime (CLR) inside the SQL Server DBMS enables database programmers to write business logic in the form of functions, stored procedures, triggers, data types, and aggregates using modern programming languages such as C#, Visual Basic, C++, COBOL, and J++. This paper presents three main aspects of this work. First, it describes the architecture of the integration of the CLR inside the SQL Server database process to provide a safe, scalable, secure, and efficient environment to run user code. Second, it describes our approach to defining and enforcing extensibility contracts to allow a tight integration of types, aggregates, functions, triggers, and procedures written in modern languages with the DBMS. Finally, it presents initial performance results showing the efficiency of user-defined types and functions relative to equivalent native DBMS features." SIGMOD Conference Integrating Vertical and Horizontal Partitioning Into Automated Physical Database Design. Sanjay Agrawal,Vivek R. Narasayya,Beverly Yang 2004 In addition to indexes and materialized views, horizontal and vertical partitioning are important aspects of physical design in a relational database system that significantly impact performance. Horizontal partitioning also provides manageability; database administrators often require indexes and their underlying tables partitioned identically so as to make common operations such as backup/restore easier. While partitioning is important, incorporating partitioning makes the problem of automating physical design much harder since: (a) The choices of partitioning can strongly interact with choices of indexes and materialized views. (b) A large new space of physical design alternatives must be considered. (c) Manageability requirements impose a new constraint on the problem. In this paper, we present novel techniques for designing a scalable solution to this integrated physical design problem that takes both performance and manageability into account. We have implemented our techniques and evaluated it on Microsoft SQL Server. Our experiments highlight: (a) the importance of taking an integrated approach to automated physical design and (b) the scalability of our techniques. SIGMOD Conference Managing Healthcare Data Hippocratically. Rakesh Agrawal,Ameet Kini,Kristen LeFevre,Amy Wang,Yirong Xu,Diana Zhou 2004 Managing Healthcare Data Hippocratically. SIGMOD Conference Enabling Sovereign Information Sharing Using Web Services. Rakesh Agrawal,Dmitri Asonov,Ramakrishnan Srikant 2004 Sovereign information sharing allows autonomous entities to compute queries across their databases in such a way that nothing apart from the result is revealed. We describe an implementation of this model using web services infrastructure. Each site participating in sovereign sharing offers a data service that allows database operations to be applied on the tables they own. Of particular interest is the provision for binary operations such as relational joins. Applications are developed by combining these data services. We present performance measurements that show the promise of a new breed of practical applications based on the paradigm of sovereign information integration. SIGMOD Conference StreaMon: An Adaptive Engine for Stream Query Processing. Shivnath Babu,Jennifer Widom 2004 StreaMon is the adaptive query processing engine of the STREAM prototype Data Stream Management System (DSMS) [4]. A fundamental challenge in many DSMS applications (e.g., network monitoring, financial monitoring over stock tickers, sensor processing) is that conditions may vary significantly over time. Since queries in these systems are usually long-running, or continuous [4], it is important to consider adaptive approaches to query processing. Without adaptivity, performance may drop drastically as stream data and arrival characteristics, query loads, and system conditions change over time.StreaMon uses several techniques to support adaptive query processing [1, 2, 3]; we demonstrate three of them:•Reducing run-time memory requirements for continuous queries by exploiting stream data and arrival patterns.•Adaptive join ordering for pipelined multiway stream joins, with strong quality guarantees.•Placing subresult caches adaptively in pipelined multiway stream joins to avoid recomputation of intermediate results. SIGMOD Conference Model-Driven Business UI based on Maps. Per Bendsen 2004 "Future business applications will often have more than 2,000 forms and need to target several user interface (UI) technologies including: Web Browsers, Windows® Applications, PDA's, and cell phones. The applications will need state-of-the-art layout combined with excellent usability with specially built forms that handle specific tasks based on user roles. How can the trade-off between developer productivity and user experience be handled?The technologies being implemented in Microsoft® Business Framework include a model-driven business UI platform that exploits flexible maps and a layered form definition. The framework generates forms based on a model of the business logic, which is an integrated part of the business framework. The generation process uses declarative and changeable maps so that the process can be controlled and modified by the business developer." SIGMOD Conference FleXPath: Flexible Structure and Full-Text Querying for XML. Sihem Amer-Yahia,Laks V. S. Lakshmanan,Shashank Pandit 2004 "Querying XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying XML documents is full-text search on textual content. In this paper, we study fundamental challenges that arise when we try to integrate these two querying paradigms.While keyword search is based on approximate matching, XPath has exact match semantics. We address this mismatch by considering queries on structure as a ""template"", and looking for answers that best match this template and the full-text search. To achieve this, we provide an elegant definition of relaxation on structure and define primitive operators to span the space of relaxations. Query answering is now based on ranking potential answers on structural and full-text search conditions. We set out certain desirable principles for ranking schemes and propose natural ranking schemes that adhere to these principles. We develop efficient algorithms for answering top-K queries and discuss results from a comprehensive set of experiments that demonstrate the utility and scalability of the proposed framework and algorithms." SIGMOD Conference Load Management and High Availability in the Medusa Distributed Stream Processing System. Magdalena Balazinska,Hari Balakrishnan,Michael Stonebraker 2004 Medusa [3, 6] is a distributed stream processing system based on the Aurora single-site stream processing engine [1]. We demonstrate how Medusa handles time-varying load spikes and provides high availability in the face of network partitions. We demonstrate Medusa in the context of Borealis, a second generation stream processing engine based on Aurora and Medusa. SIGMOD Conference The Price of Validity in Dynamic Networks. Mayank Bawa,Aristides Gionis,Hector Garcia-Molina,Rajeev Motwani 2004 Massive-scale self-administered networks like Peer-to-Peer and Sensor Networks have data distributed across thousands of participant hosts. These networks are highly dynamic with short-lived hosts being the norm rather than an exception. In recent years, researchers have investigated best-effort algorithms to efficiently process aggregate queries (e.g., sum, count, average, minimum and maximum) on these networks. Unfortunately, query semantics for best-effort algorithms are ill-defined, making it hard to reason about guarantees associated with the result returned. In this paper, we specify a correctness condition, Single-Site Validity, with respect to which the above algorithms are best-effort. We present a class of algorithms that guarantee validity in dynamic networks. Experiments on real-life and synthetic network topologies validate performance of our algorithms, revealing the hitherto unknown price of validity. SIGMOD Conference Incremental Evaluation of Schema-Directed XML Publishing. Philip Bohannon,Peter Buneman,Byron Choi,Wenfei Fan 2004 Incremental Evaluation of Schema-Directed XML Publishing. SIGMOD Conference BODHI: A Database Habitat for Bio-diversity Information. Srikanta J. Bedathur,Abhijit Kadlag,Jayant R. Haritsa 2004 BODHI: A Database Habitat for Bio-diversity Information. SIGMOD Conference Computing Clusters of Correlation Connected Objects. Christian Böhm,Karin Kailing,Peer Kröger,Arthur Zimek 2004 The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Well-known methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and density-based clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS. SIGMOD Conference Liquid Data for WebLogic: Integrating Enterprise Data and Services. Vinayak R. Borkar 2004 "Information in today's enterprises commonly resides in a variety of heterogeneous data sources, including relational databases, web services, files, packaged applications, and custom data repositories. BEA's enterprise information integration product, Liquid Data for WebLogic, takes an XML-based approach to providing integrated access to such heterogeneous information. This demonstration highlights the XML technologies involved - including web services, XQuery, and XML Schema - and shows how they can be brought to bear on the enterprise information integration problem. The demonstration uses a simple end-to-end example, one that involves integrating data from relational databases and web services, to walk the audience through the overall architecture, XML-based data modeling approach, programming model, declarative query and view facilities, and distributed processing features of Liquid Data." SIGMOD Conference Data Stream Management for Historical XML Data. Sujoe Bose,Leonidas Fegaras 2004 We are presenting a framework for continuous querying of time-varying streamed XML data. A continuous stream in our framework consists of a finite XML document followed by a continuous stream of updates. The unit of update is an XML fragment, which can relate to other fragments through system-generated unique IDs. The reconstruction of temporal data from continuous updates at a current time is never materialized and historical queries operate directly on the fragmented streams. We are incorporating temporal constructs to XQuery with minimal changes to the existing language structure to support continuous querying of time-varying streams of XML data. Our extensions use time projections to capture time-sliding windows, version control for tuple-based windows, and coincidence queries to synchronize events between streams. These XQuery extensions are compiled away to standard XQuery code and the resulting queries operate continuously over the existing fragmented streams. SIGMOD Conference A TeXQuery-Based XML Full-Text Search Engine. Chavdar Botev,Jayavel Shanmugasundaram,Sihem Amer-Yahia 2004 A TeXQuery-Based XML Full-Text Search Engine. SIGMOD Conference Optimization of Query Streams Using Semantic Prefetching. Ivan T. Bowman,Kenneth Salem 2004 "Streams of relational queries submitted by client applications to database servers contain patterns that can be used to predict future requests. We present the Scalpel system, which detects these patterns and optimizes request streams using context-based predictions of future requests. Scalpel uses its predictions to provide a form of semantic prefetching, which involves combining a predicted series of requests into a single request that can be issued immediately. Scalpel's semantic prefetching reduces not only the latency experienced by the application but also the total cost of query evaluation. We describe how Scalpel learns to predict optimizable request patterns by observing the application's request stream during a training phase. We also describe the types of query pattern rewrites that Scalpels cost-based optimizer considers. Finally, we present empirical results that show the costs and benefits of Scalpel's optimizations." SIGMOD Conference Declarative Specification of Web Applications exploiting Web Services and Workflows. Marco Brambilla,Stefano Ceri,Sara Comai,Marco Dario,Piero Fraternali,Ioana Manolescu 2004 This demo presents an extension of a declarative language for specifying data-intensive Web applications. We demonstrate a scenario extracted from a real-life application, the Web portal of a computer manufacturer, including interactions with third-party service providers and enabling distributors to participate in well-defined business processes. The crucial advantage of our framework is the high-level modeling of a complex Web application, extended with Web service and workflow capabilities. The application is automatically verified for correctness and the code is automatically generated and deployed. SIGMOD Conference Conditional Selectivity for Statistics on Query Expressions. Nicolas Bruno,Surajit Chaudhuri 2004 Cardinality estimation during query optimization relies on simplifying assumptions that usually do not hold in practice. To diminish the impact of inaccurate estimates during optimization, statistics on query expressions (SITs) have been previously proposed. These statistics help directly model the distribution of tuples on query sub-plans. Past work in statistics on query expressions has exploited view matching technology to harness their benefits. In this paper we argue against such an approach as it overlooks significant opportunities for improvement in cardinality estimations. We then introduce a framework to reason with SITs based on the notion of conditional selectivity. We present a dynamic programming algorithm to efficiently find the most accurate selectivity estimation for given queries, and discuss how such an approach can be incorporated into existing optimizers with a small number of changes. Finally, we demonstrate experimentally that our technique results in superior cardinality estimations than previous approaches with very little overhead. SIGMOD Conference MAIDS: Mining Alarming Incidents from Data Streams. Y. Dora Cai,David Clutter,Greg Pape,Jiawei Han,Michael Welge,Loretta Auvil 2004 MAIDS: Mining Alarming Incidents from Data Streams. SIGMOD Conference XML in the Middle: XQuery in the WebLogic Platform. Michael J. Carey 2004 "The BEA WebLogic Platform product suite consists of WebLogic Server, WebLogic Workshop, WebLogic Integration, WebLogic Portal, and Liquid Data for WebLogic. W3C standards including XML, XML Schema, and the emerging XML query language XQuery play important roles in several of these products. This industrial presentation will discuss the increasingly central role of XML in the middle tier of enterprise IT architectures and cover some of the key XML technologies that the BEA WebLogic Platform provides for creating enterprise applications in today's IT world. We focus in particular on how XQuery fits into this picture, both for today's WebLogic Platform 8.1 and going forward in terms of the Platform roadmap." SIGMOD Conference Efficient Development of Data Migration Transformations. Paulo J. F. Carreira,Helena Galhardas 2004 In this paper, we present a data migration tool named DATA FUSION. Its main features are: A domain specific language designed to conveniently model complex data transformations; an integrated development environment that assists users on managing complex data transformation projects and an auditing facility that provides relevant information to project managers and external auditors. SIGMOD Conference Automatic Categorization of Query Results. Kaushik Chakrabarti,Surajit Chaudhuri,Seung-won Hwang 2004 "Exploratory ad-hoc queries could return too many answers - a phenomenon commonly referred to as ""information overload"". In this paper, we propose to automatically categorize the results of SQL queries to address this problem. We dynamically generate a labeled, hierarchical category structure - users can determine whether a category is relevant or not by examining simply its label; she can then explore just the relevant categories and ignore the remaining ones, thereby reducing information overload. We first develop analytical models to estimate information overload faced by a user for a given exploration. Based on those models, we formulate the categorization problem as a cost optimization problem and develop heuristic algorithms to compute the min-cost categorization." SIGMOD Conference Effective Use of Block-Level Sampling in Statistics Estimation. Surajit Chaudhuri,Gautam Das,Utkarsh Srivastava 2004 Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods. SIGMOD Conference Estimating Progress of Long Running SQL Queries. Surajit Chaudhuri,Vivek R. Narasayya,Ravishankar Ramamurthy 2004 Estimating Progress of Long Running SQL Queries. SIGMOD Conference BLAS: An Efficient XPath Processing System. Yi Chen,Susan B. Davidson,Yifeng Zheng 2004 We present BLAS, a Bi-LAbeling based System, for efficiently processing complex XPath queries over XML data. BLAS uses P-labeling to process queries involving consecutive child axes, and D-labeling to process queries involving descendant axes traversal. The XML data is stored in labeled form, and indexed to optimize descendent axis traversals. Three algorithms are presented for translating complex XPath queries to SQL expressions, and two alternate query engines are provided. Experimental results demonstrate that the BLAS system has a substantial performance improvement compared to traditional XPath processing using D-labeling. SIGMOD Conference Cost-Based Labeling of Groups of Mass Spectra. Lei Chen,Zheng Huang,Raghu Ramakrishnan 2004 We make two main contributions in this paper. First, we motivate and introduce a novel class of data mining problems that arise in labeling a group of mass spectra, specifically for analysis of atmospheric aerosols, but with natural applications to market-basket datasets. This builds upon other recent work in which we introduced the problem of labeling a single spectrum, and is motivated by the advent of a new generation of Aerosol Time-of-Flight Spectrometers, which are capable of generating mass spectra for hundreds of aerosol particles per minute. We also describe two algorithms for group labeling, which differ greatly in how they utilize a linear programming (LP) solver, and also differ greatly from algorithms for labeling a single spectrum.Our second contribution is to show how to automatically select between these two algorithms in a cost-based manner, analogous to how a query optimizer selects from a space of query plans. While the details are specific to the labeling problem, we believe that is a promising first step towards a general framework for cost-based data mining, and opens up an important direction for future search. SIGMOD Conference Querying at Internet-Scale. Brent N. Chun,Joseph M. Hellerstein,Ryan Huebsch,Shawn R. Jeffery,Boon Thau Loo,Sam Mardanbeigi,Timothy Roscoe,Sean C. Rhea,Scott Shenker,Ion Stoica 2004 "We are developing a distributed query processor called PIER, which is designed to run on the scale of the entire Internet. PIER utilizes a Distributed Hash Table (DHT) as its communication substrate in order to achieve scalability, reliability, decentralized control, and load balancing. PIER enhances DHTs with declarative and algebraic query interfaces, and underneath those interfaces implements multihop, in-network versions of joins, aggregation, recursion, and query/result dissemination. PIER is currently being used for diverse applications, including network monitoring, keyword-based filesharing search, and network topology mapping. We will demonstrate PIER's functionality by showing system monitoring queries running on PlanetLab, a testbed of over 300 machines distributed across the globe." SIGMOD Conference Spatially-decaying aggregation over a network: model and algorithms. Edith Cohen,Haim Kaplan 2004 Data items are often associated with a location in which they are present or collected, and their relevance or influence decays with their distance. Aggregate values over such data thus depend on the observing location, where the weight given to each item depends on its distance from that location. We term such aggregation spatially-decaying.Spatially-decaying aggregation has numerous applications: Individual sensor nodes collect readings of an environmental parameter such as contamination level or parking spot availability; the nodes then communicate to integrate their readings so that each location obtains contamination level or parking availability in its neighborhood. Nodes in a p2p network could use a summary of content and properties of nodes in their neighborhood in order to guide search. In graphical databases such as Web hyperlink structure, properties such as subject of pages that can reach or be reached from a page using link traversals provide information on the page.We formalize the notion of spatially-decaying aggregation and develop efficient algorithms for fundamental aggregation functions, including sums and averages, random sampling, heavy hitters, quantiles, and Lp norms. SIGMOD Conference FARMER: Finding Interesting Rule Groups in Microarray Datasets. Gao Cong,Anthony K. H. Tung,Xin Xu,Feng Pan,Jiong Yang 2004 Microarray datasets typically contain large number of columns but small number of rows. Association rules have been proved to be useful in analyzing such datasets. However, most existing association rule mining algorithms are unable to efficiently handle datasets with large number of columns. Moreover, the number of association rules generated from such datasets is enormous due to the large number of possible column combinations.In this paper, we describe a new algorithm called FARMER that is specially designed to discover association rules from microarray datasets. Instead of finding individual association rules, FARMER finds interesting rule groups which are essentially a set of rules that are generated from the same set of rows. Unlike conventional rule mining algorithms, FARMER searches for interesting rules in the row enumeration space and exploits all user-specified constraints including minimum support, confidence and chi-square to support efficient pruning. Several experiments on real bioinformatics datasets show that FARMER is orders of magnitude faster than previous association rule mining algorithms. SIGMOD Conference Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data. Graham Cormode,Flip Korn,S. Muthukrishnan,Divesh Srivastava 2004 "Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of large-valued regions (""heavy hitters""). Such Hierarchical Heavy Hitters (HHHs) were previously introduced as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions.We identify and study two variants of HHHs for multi-dimensional data, namely the ""overlap"" and ""split"" cases, depending on how an aggregate computed for a child node in the multi-dimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees.We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions (""diamond"") is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions." SIGMOD Conference An Indexing Framework for Peer-to-Peer Systems. Adina Crainiceanu,Prakash Linga,Ashwin Machanavajjhala,Johannes Gehrke,Jayavel Shanmugasundaram 2004 An Indexing Framework for Peer-to-Peer Systems. SIGMOD Conference Parallel SQL Execution in Oracle 10g. Thierry Cruanes,Benoît Dageville,Bhaskar Ghosh 2004 "This paper describes the new architecture and optimizations for parallel SQL execution in the Oracle 10g database. Based on the fundamental shared-disk architecture underpinning Oracle's parallel SQL execution engine since Oracle7, we show in this paper how Oracle's engine responds to the challenges of performing in new grid-computing environments. This is made possible by using advanced optimization techniques, which enable Oracle to exploit data and system architecture dynamically without being constrained by them. We show how we have evolved and re-architected our engine in Oracle 10g to make it more efficient and manageable by using a single global parallel plan model." SIGMOD Conference dbSwitch™ - Towards a Database Utility. Shaul Dar,Gil Hecht,Eden Shochat 2004 dbSwitch™ - Towards a Database Utility. SIGMOD Conference Approximation Techniques for Spatial Data. Abhinandan Das,Johannes Gehrke,Mirek Riedewald 2004 Spatial Database Management Systems (SDBMS), e.g., Geographical Information Systems, that manage spatial objects such as points, lines, and hyper-rectangles, often have very high query processing costs. Accurate selectivity estimation during query optimization therefore is crucially important for finding good query plans, especially when spatial joins are involved. Selectivity estimation has been studied for relational database systems, but to date has only received little attention in SDBMS. In this paper, we introduce novel methods that permit high-quality selectivity estimation for spatial joins and range queries. Our techniques can be constructed in a single scan over the input, handle inserts and deletes to the database incrementally, and hence they can also be used for processing of streaming spatial data. In contrast to previous approaches, our techniques return approximate results that come with provable probabilistic quality guarantees. We present a detailed analysis and experimentally demonstrate the efficacy of the proposed techniques. SIGMOD Conference Compressing Historical Information in Sensor Networks. Antonios Deligiannakis,Yannis Kotidis,Nick Roussopoulos 2004 We are inevitably moving into a realm where small and inexpensive wireless devices would be seamlessly embedded in the physical world and form a wireless sensor network in order to perform complex monitoring and computational tasks. Such networks pose new challenges in data processing and dissemination because of the limited resources (processing, bandwidth, energy) that such devices possess. In this paper we propose a new technique for compressing multiple streams containing historical data from each sensor. Our method exploits correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements. The key to our technique is the base signal, a series of values extracted from the real measurements, used for encoding piece-wise linear correlations among the collected data values. We provide efficient algorithms for extracting the base signal features from the data and for encoding the measurements using these features. Our experiments demonstrate that our method by far outperforms standard approximation techniques like Wavelets. Histograms and the Discrete Cosine Transform, on a variety of error metrics and for real datasets from different domains. SIGMOD Conference Service-Oriented BI: Towards tight integration of business intelligence into operational applications. Marcus Dill,Achim Kraiss,Stefan Sigg,Thomas Zurek 2004 Service-Oriented BI: Towards tight integration of business intelligence into operational applications. SIGMOD Conference Joining Interval Data in Relational Databases. Jost Enderle,Matthias Hampel,Thomas Seidl 2004 The increasing use of temporal and spatial data in present-day relational systems necessitates an efficient support of joins on interval-valued attributes. Standard join algorithms do not support those data types adequately, whereas special approaches for interval joins usually require an augmentation of the internal access methods which is not supported by existing relational systems. To overcome these problems we introduce new join algorithms for interval data. Based on the Relational Interval Tree, these algorithms can easily be implemented on top of any relational database system while providing excellent performance on joining intervals. As experimental results on an Oracle9i server show, the new techniques outperform existing relational methods for joining intervals significantly. SIGMOD Conference Yoo-Hoo! Building a Presence Service with XQuery and WSDL. Mary F. Fernández,Nicola Onose,Jérôme Siméon 2004 Yoo-Hoo! Building a Presence Service with XQuery and WSDL. SIGMOD Conference Indexing and Mining Streams. Christos Faloutsos 2004 Indexing and Mining Streams. SIGMOD Conference Secure XML Querying with Security Views. Wenfei Fan,Chee Yong Chan,Minos N. Garofalakis 2004 The prevalent use of XML highlights the need for a generic, flexible access-control mechanism for XML documents that supports efficient and secure query access, without revealing sensitive information unauthorized users. This paper introduces a novel paradigm for specifying XML security constraints and investigates the enforcement of such constraints during XML query evaluation. Our approach is based on the novel concept of security views, which provide for each user group (a) an XML view consisting of all and only the information that the users are authorized to access, and (b) a view DTD that the XML view conforms to. Security views effectively protect sensitive data from access and potential inferences by unauthorized user, and provide authorized users with necessary schema information to facilitate effective query formulation and optimization. We propose an efficient algorithm for deriving security view definitions from security policies (defined on the original document DTD) for different user groups. We also develop novel algorithms for XPath query rewriting and optimization such that queries over security views can be efficiently answered without materializing the views. Our algorithms transform a query over a security view to an equivalent query over the original document, and effectively prune query nodes by exploiting the structural properties of the document DTD in conjunction with approximate XPath containment tests. Our work is the first to study a flexible, DTD-based access-control model for XML and its implications on the XML query-execution engine. Furthermore, it is among the first efforts for query rewriting and optimization in the presence of general DTDs for a rich a class of XPath queries. An empirical study based on real-life DTDs verifies the effectiveness of our approach. SIGMOD Conference Rethinking the Conference Reviewing Process - Panel. Michael J. Franklin,Jennifer Widom,Gerhard Weikum,Philip A. Bernstein,Alon Y. Halevy,David J. DeWitt,Anastassia Ailamaki,Zachary G. Ives 2004 Rethinking the Conference Reviewing Process - Panel. SIGMOD Conference Share your data, Keep your secrets. Irini Fundulaki,Arnaud Sahuguet 2004 Share your data, Keep your secrets. SIGMOD Conference Fast Computation of Database Operations using Graphics Processors. Naga K. Govindaraju,Brandon Lloyd,Wei Wang,Ming C. Lin,Dinesh Manocha 2004 "We present new algorithms for performing fast computation of several common database operations on commodity graphics processors. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semi-linear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA's GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPU-based algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective co-processor for performing database operations." SIGMOD Conference Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials. Yuhan Cai,Raymond T. Ng 2004 In this paper, we attempt to approximate and index a d- dimensional (d ≥ 1) spatio-temporal trajectory with a low order continuous polynomial. There are many possible ways to choose the polynomial, including (continuous)Fourier transforms, splines, non-linear regressino, etc. Some of these possiblities have indeed been studied beofre. We hypothesize that one of the best possibilities is the polynomial that minimizes the maximum deviation from the true value, which is called the minimax polynomial. Minimax approximation is particularly meaningful for indexing because in a branch-and-bound search (i.e., for finding nearest neighbours), the smaller the maximum deviation, the more pruning opportunities there exist. However, in general, among all the polynomials of the same degree, the optimal minimax polynomial is very hard to compute. However, it has been shown thta the Chebyshev approximation is almost identical to the optimal minimax polynomial, and is easy to compute [16]. Thus, in this paper, we explore how to use the Chebyshev polynomials as a basis for approximating and indexing d-dimenstional trajectories.The key analytic result of this paper is the Lower Bounding Lemma. that is, we show that the Euclidean distance between two d-dimensional trajectories is lower bounded by the weighted Euclidean distance between the two vectors of Chebyshev coefficients. this lemma is not trivial to show, and it ensures that indexing with Chebyshev cofficients aedmits no false negatives. To complement that analystic result, we conducted comprehensive experimental evaluation with real and generated 1-dimensional to 4-dimensional data sets. We compared the proposed schem with the Adaptive Piecewise Constant Approximation (APCA) scheme. Our preliminary results indicate that in all situations we tested, Chebyshev indexing dominates APCA in pruning power, I/O and CPU costs. SIGMOD Conference Query Processing for SQL Updates. César A. Galindo-Legaria,Stefano Stefani,Florian Waas 2004 A rich set of concepts and techniques has been developed in the context of query processing for the efficient and robust execution of queries. So far, this work has mostly focused on issues related to data-retrieval queries, with a strong backing on relational algebra. However, update operations can also exhibit a number of query processing issues, depending on the complexity of the operations and the volume of data to process. Such issues include lookup and matching of values, navigational vs. set-oriented algorithms and trade-offs between plans that do serial or random I/Os.In this paper we present an overview of the basic techniques used to support SQL DML (Data Manipulation Language) in Microsoft SQL Server. Our focus is on the integration of update operations into the query processor, the query execution primitives required to support updates, and the update-specific considerations to analyze and execute update plans. Full integration of update processing in the query processor provides a robust and flexible framework and leverages existing query processing techniques. SIGMOD Conference Transaction support for indexed views. Goetz Graefe,Michael J. Zwilling 2004 Transaction support for indexed views. SIGMOD Conference The Next Database Revolution. Jim Gray 2004 Database system architectures are undergoing revolutionary changes. Most importantly, algorithms and data are being unified by integrating programming languages with the database system. This gives an extensible object-relational system where non-procedural relational operators manipulate object sets. Coupled with this, each DBMS is now a web service. This has huge implications for how we structure applications. DBMSs are now object containers. Queues are the first objects to be added. These queues are the basis for transaction processing and workflow applications. Future workflow systems are likely to be built on this core. Data cubes and online analytic processing are now baked into most DBMSs. Beyond that, DBMSs have a framework for data mining and machine learning algorithms. Decision trees, Bayes nets, clustering, and time series analysis are built in; new algorithms can be added. There is a rebirth of column stores for sparse tables and to optimize bandwidth. Text, temporal, and spatial data access methods, along with their probabilistic reasoning have been added to database systems. Allowing approximate and probabilistic answers is essential for many applications. Many believe that XML and xQuery will be the main data structure and access pattern. Database systems must accommodate that perspective. External data increasingly arrives as streams to be compared to historical data; so stream-processing operators are being added to the DBMS. Publish-subscribe systems invert the data-query ratios; incoming data is compared against millions of queries rather than queries searching millions of records. Meanwhile, disk and memory capacities are growing much faster than their bandwidth and latency, so the database systems increasingly use huge main memories and sequential disk access. These changes mandate a much more dynamic query optimization strategy - one that adapts to current conditions and selectivities rather than having a static plan. Intelligence is moving to the periphery of the network. Each disk and each sensor will be a competent database machine. Relational algebra is a convenient way to program these systems. Database systems are now expected to be self-managing, self-healing, and always-up. We researchers and developers have our work cut out for us in delivering all these features. SIGMOD Conference Web-CAM: Monitoring the dynamic Web to respond to Continual Queries. Shaveen Garg,Krithi Ramamritham,Soumen Chakrabarti 2004 Web-CAM: Monitoring the dynamic Web to respond to Continual Queries. SIGMOD Conference Query Sampling in DB2 Universal Database. Jarek Gryz,Junjie Guo,Linqi Liu,Calisto Zuzarte 2004 Executing ad hoc queries against large databases can be prohibitively expensive. Exploratory analysis of data may not require exact answers to queries, however: results based on sampling the data are often satisfactory. Supporting sampling as a primitive SQL operator turns out to be difficult because sampling does not commute with many SQL operators.In this paper, we describe an implementation in IBM® DB2® Universal Database (UDB) of a sampling operator that commutes with some SQL operators. As a result, the query with the sampling operator always returns a random sample of the answers and in many cases runs faster than it would have without such an operator. SIGMOD Conference Secure, Reliable, Transacted; Innovation in Web Services Architecture. Martin Gudgin 2004 This paper discusses the design of Web Services Protocols paying special attention to composition of such protocols. The transaction related protocols are discussed as exemplars. SIGMOD Conference "Relaxed Currency and Consistency: How to Say ""Good Enough"" in SQL." Hongfei Guo,Per-Åke Larson,Raghu Ramakrishnan,Jonathan Goldstein 2004 "Despite the widespread and growing use of asynchronous copies to improve scalability, performance and availability, this practice still lacks a firm semantic foundation. Applications are written with some understanding of which queries can use data that is not entirely current and which copies are ""good enough""; however, there are neither explicit requirements nor guarantees. We propose to make this knowledge available to the DBMS through explicit currency and consistency (C&C) constraints in queries and develop techniques so the DBMS can guarantee that the constraints are satisfied. In this paper we describe our model for expressing C&C constraints, define their semantics, and propose SQL syntax. We explain how C&C constraints are enforced in MTCache, our prototype mid-tier database cache, including how constraints and replica update policies are elegantly integrated into the cost-based query optimizer. Consistency constraints are enforced at compile time while currency constraints are enforced at run time by dynamic plans that check the currency of each local replica before use and select sub-plans accordingly. This approach makes optimal use of the cache DBMS while at the same time guaranteeing that applications always get data that is ""good enough"" for their purpose." SIGMOD Conference Support for Relaxed Currency and Consistency Constraints in MTCache. Hongfei Guo,Per-Åke Larson,Raghu Ramakrishnan,Jonathan Goldstein 2004 Support for Relaxed Currency and Consistency Constraints in MTCache. SIGMOD Conference Data Densification in a Relational Database System. Abhinav Gupta,Sankar Subramanian,Srikanth Bellamkonda,Tolga Bozkaya,Nathan Folkert,Lei Sheng,Andrew Witkowski 2004 "Data in a relational data warehouse is usually sparse. That is, if no value exists for a given combination of dimension values, no row exists in the fact table. Densities of 0.1-2% are very common. However, users may want to view the data in a dense form, with rows for all combination of dimension values displayed even when no fact data exists for them. For example, if a product did not sell during a particular time period, users may still want to see the product for that time period with zero sales value next to it. Moreover, analytic window functions [1] and the SQL model clause [2] can more easily express time series calculations if data is dense along the time dimension because dense data will fill a consistent number of rows for each period.Data densification is the process of converting spare data into dense form. The current SQL technique for densification (using the combination of DISTINCT, CROSS JOIN and OUTER JOIN operations) is extremely unintuitive, difficult to express and inefficient to compute. Hence, we propose an extension to the ANSI SQL join operator, referred to as ""PARTITIONED OUTER JOIN"", which allows for a succinct expression of densification along the dimensions of interest. We also present various algorithms to evaluate the new join operator efficiently and compare it with existing methods of doing the equivalent operation. We also define a new window function ""LAST_VALUE (IGNORE NULLS)"" which is very useful with partitioned outer join." SIGMOD Conference A Bi-Level Bernoulli Scheme for Database Sampling. Peter J. Haas,Christian Koenig 2004 "Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the ""bang-bang"" type: we identify a ""page-heterogeneity index"" (PHI) such that optimal sampling is as ""row-like"" as possible if the PHI is less than 1 and as ""page-like"" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed." SIGMOD Conference Requirements and Policy Challenges in Highly Secure Environments. Dean E. Hall 2004 Requirements and Policy Challenges in Highly Secure Environments. SIGMOD Conference Knocking the Door to the Deep Web: Integration of Web Query Interfaces. Bin He,Zhen Zhang,Kevin Chen-Chuan Chang 2004 Knocking the Door to the Deep Web: Integration of Web Query Interfaces. SIGMOD Conference ITQS: An Integrated Transport Query System. Bo Huang,Zhiyong Huang,Dan Lin,Hua Lu,Yaxiao Song,Hongga Li 2004 ITQS: An Integrated Transport Query System. SIGMOD Conference Tools for Design of Composite Web Services. Richard Hull,Jianwen Su 2004 Tools for Design of Composite Web Services. SIGMOD Conference TOSS: An Extension of TAX with Ontologies and Similarity Queries. Edward Hung,Yu Deng,V. S. Subrahmanian 2004 "TAX is perhaps the best known extension of the relational algebra to handle queries to XML databases. One problem with TAX (as with many existing relational DBMSs) is that the semantics of terms in a TAX DB are not taken into account when answering queries. Thus, even though TAX answers queries with 100% precision, the recall of TAX is relatively low. Our TOSS system improves the recall of TAX via the concept of a similarity enhanced ontology (SEO). Intuitively, an ontology is a set of graphs describing relationships (such as isa, partof, etc.) between terms in a DB. An SEO also evaluates how similarities between terms (e.g. ""J. Ullman"", ""Jeff Ullman"", and ""Jeffrey Ullman"") affect ontologies. Finally, we show how the algebra proposed in TAX can be extended to take SEOs into account. The result is a system that provides a much higher answer quality than TAX does alone (quality is defined as the square root of the product of precision and recall). We experimentally evaluate the TOSS system on the DBLP and SIGMOD bibliographic databases and show that TOSS has acceptable performance." SIGMOD Conference P2P-DIET: An Extensible P2P Service that Unifies Ad-hoc and Continuous Querying in Super-Peer Networks. Stratos Idreos,Manolis Koubarakis,Christos Tryfonopoulos 2004 P2P-DIET: An Extensible P2P Service that Unifies Ad-hoc and Continuous Querying in Super-Peer Networks. SIGMOD Conference CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. Ihab F. Ilyas,Volker Markl,Peter J. Haas,Paul Brown,Ashraf Aboulnaga 2004 "The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but can also cause query optimizers---which usually assume that columns are statistically independent---to underestimate the selectivities of conjunctive predicates by orders of magnitude. We introduce CORDS, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between columns. CORDS searches for column pairs that might have interesting and useful dependency relations by systematically enumerating candidate pairs and simultaneously pruning unpromising candidates using a flexible set of heuristics. A robust chi-squared analysis is applied to a sample of column values in order to identify correlations, and the number of distinct values in the sampled columns is analyzed to detect soft functional dependencies. CORDS can be used as a data mining tool, producing dependency graphs that are of intrinsic interest. We focus primarily on the use of CORDS in query optimization. Specifically, CORDS recommends groups of columns on which to maintain certain simple joint statistics. These ""column-group"" statistics are then used by the optimizer to avoid naive selectivity estimates based on inappropriate independence assumptions. This approach, because of its simplicity and judicious use of sampling, is relatively easy to implement in existing commercial systems, has very low overhead, and scales well to the large numbers of columns and large table sizes found in real-world databases. Experiments with a prototype implementation show that the use of CORDS in query optimization can speed up query execution times by an order of magnitude. CORDS can be used in tandem with query feedback systems such as the LEO learning optimizer, leveraging the infrastructure of such systems to correct bad selectivity estimates and ameliorating the poor performance of feedback systems during slow learning phases." SIGMOD Conference Rank-aware Query Optimization. Ihab F. Ilyas,Rahul Shah,Walid G. Aref,Jeffrey Scott Vitter,Ahmed K. Elmagarmid 2004 Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization.We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. The experiments show the validity of our framework and the accuracy of the proposed estimation model. SIGMOD Conference When one Sample is not Enough: Improving Text Database Selection Using Shrinkage. Panagiotis G. Ipeirotis,Luis Gravano 2004 "Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of ""shrinkage"" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their ""unshrunk"" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated ""relevance judgments,"" show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well." SIGMOD Conference Adapting to Source Properties in Processing Data Integration Queries. Zachary G. Ives,Alon Y. Halevy,Daniel S. Weld 2004 "An effective query optimizer finds a query plan that exploits the characteristics of the source data. In data integration, little is known in advance about sources' properties, which necessitates the use of adaptive query processing techniques to adjust query processing on-the-fly. Prior work in adaptive query processing has focused on compensating for delays and adjusting for mis-estimated cardinality or selectivity values. In this paper, we present a generalized architecture for adaptive query processing and introduce a new technique, called adaptive data partitioning (ADP), which is based on the idea of dividing the source data into regions, each executed by different, complementary plans. We show how this model can be applied in novel ways to not only correct for underestimated selectivity and cardinality values, but also to discover and exploit order in the source data, and to detect and exploit source data that can be effectively pre-aggregated. We experimentally compare a number of alternative strategies and show that our approach is effective." SIGMOD Conference "Colorful XML: One Hierarchy Isn't Enough." H. V. Jagadish,Laks V. S. Lakshmanan,Monica Scannapieco,Divesh Srivastava,Nuwee Wiwatwattana 2004 XML has a tree-structured data model, which is used to uniformly represent structured as well as semi-structured data, and also enable concise query specification in XQuery, via the use of its XPath (twig) patterns. This in turn can leverage the recently developed technology of structural join algorithms to evaluate the query efficiently. In this paper, we identify a fundamental tension in XML data modeling: (i) data represented as deep trees (which can make effective use of twig patterns) are often un-normalized, leading to update anomalies, while (ii) normalized data tends to be shallow, resulting in heavy use of expensive value-based joins in queries.Our solution to this data modeling problem is a novel multi-colored trees (MCT) logical data model, which is an evolutionary extension of the XML data model, and permits trees with multi-colored nodes to signify their participation in multiple hierarchies. This adds significant semantic structure to individual data nodes. We extend XQuery expressions to navigate between structurally related nodes, taking color into account, and also to create new colored trees as restructurings of an MCT database. While MCT serves as a significant evolutionary extension to XML as a logical data model, one of the key roles of XML is for information exchange. To enable exchange of MCT information, we develop algorithms for optimally serializing an MCT database as XML. We discuss alternative physical representations for MCT databases, using relational and native XML databases, and describe an implementation on top of the Timber native XML database. Experimental evaluation, using our prototype implementation, shows that not only are MCT queries/updates more succinct and easier to express than equivalent shallow tree XML queries, but they can also be significantly more efficient to evaluate than equivalent deep and shallow tree XML queries/updates. SIGMOD Conference Adaptive Stream Resource Management Using Kalman Filters. Ankur Jain,Edward Y. Chang,Yuan-Fang Wang 2004 "To answer user queries efficiently, a stream management system must handle continuous, high-volume, possibly noisy, and time-varying data streams. One major research area in stream management seeks to allocate resources (such as network bandwidth and memory) to query plans, either to minimize resource usage under a precision requirement, or to maximize precision of results under resource constraints. To date, many solutions have been proposed; however, most solutions are ad hoc with hard-coded heuristics to generate query plans. In contrast, we perceive stream resource management as fundamentally a filtering problem, in which the objective is to filter out as much data as possible to conserve resources, provided that the precision standards can be met. We select the Kalman Filter as a general and adaptive filtering solution for conserving resources. The Kalman Filter has the ability to adapt to various stream characteristics, sensor noise, and time variance. Furthermore, we realize a significant performance boost by switching from traditional methods of caching static data (which can soon become stale) to our method of caching dynamic procedures that can predict data reliably at the server without the clients' involvement. In this work we focus on minimization of communication overhead for both synthetic and real-world streams. Through examples and empirical studies, we demonstrate the flexibility and effectiveness of using the Kalman Filter as a solution for managing trade-offs between precision of results and resources in satisfying stream queries." SIGMOD Conference Online Maintenance of Very Large Random Samples. Chris Jermaine,Abhijit Pol,Subramanian Arumugam 2004 "Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a ""sample"" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling." SIGMOD Conference Efficient Processing of Twig Queries with OR-Predicates. Haifeng Jiang,Hongjun Lu,Wei Wang 2004 Efficient Processing of Twig Queries with OR-Predicates. SIGMOD Conference Holistic UDAFs at streaming speeds. Graham Cormode,Theodore Johnson,Flip Korn,S. Muthukrishnan,Oliver Spatscheck,Divesh Srivastava 2004 "Many algorithms have been proposed to approximate holistic aggregates, such as quantiles and heavy hitters, over data streams. However, little work has been done to explore what techniques are required to incorporate these algorithms in a data stream query processor, and to make them useful in practice.In this paper, we study the performance implications of using user-defined aggregate functions (UDAFs) to incorporate selection-based and sketch-based algorithms for holistic aggregates into a data stream management system's query processing architecture. We identify key performance bottlenecks and tradeoffs, and propose novel techniques to make these holistic UDAFs fast and space-efficient for use in high-speed data stream applications. We evaluate performance using generated and actual IP packet data, focusing on approximating quantiles and heavy hitters. The best of our current implementations can process streaming queries at OC48 speeds (2x 2.4Gbps)." SIGMOD Conference On the Integration of Structure Indexes and Inverted Lists. Raghav Kaushik,Rajasekar Krishnamurthy,Jeffrey F. Naughton,Raghu Ramakrishnan 2004 Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists. We propose a strategy that combines the two forms of auxiliary indexes, and a query evaluation algorithm for branching path expressions based on this strategy. Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms. Our experiments over the Niagara XML DBMS show the benefit of integrating the two forms of indexes. We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated. By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al., we obtain instance optimal algorithms to push down top k computation. SIGMOD Conference SoundCompass: A Practical Query-by-Humming System. Naoko Kosugi,Yasushi Sakurai,Masashi Morimoto 2004 SoundCompass: A Practical Query-by-Humming System. SIGMOD Conference PIPES - A Public Infrastructure for Processing and Exploring Streams. Jürgen Krämer,Bernhard Seeger 2004 "PIPES is a flexible and extensible infrastructure providing fundamental building blocks to implement a data stream management system (DSMS). It is seamlessly integrated into the Java library XXL [1, 2, 3] for advanced query processing and extends XXL's scope towards continuous data-driven query processing over autonomous data sources." SIGMOD Conference LexEQUAL: Multilexical Matching Operator in SQL. A. Kumaran,Jayant R. Haritsa 2004 LexEQUAL: Multilexical Matching Operator in SQL. SIGMOD Conference iMAP: Discovering Complex Mappings between Database Schemas. Robin Dhamankar,Yoonkyong Lee,AnHai Doan,Alon Y. Halevy,Pedro Domingos 2004 iMAP: Discovering Complex Mappings between Database Schemas. SIGMOD Conference Using the Structure of Web Sites for Automatic Segmentation of Tables. Kristina Lerman,Lise Getoor,Steven Minton,Craig A. Knoblock 2004 "Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites." SIGMOD Conference Fast Algorithms for Time Series with applications to Finance, Physics, Music, Biology, and other Suspects. Alberto Lerner,Dennis Shasha,Zhihua Wang,Xiaojian Zhao,Yunyue Zhu 2004 "Financial time series streams are watched closely by millions of traders. What exactly do they look for and how can we help them do it faster? Physicists study the time series emerging from their sensors. The same question holds for them. Musicians produce time series. Consumers may want to compare them. This tutorial presents techniques and case studies for four problems:1. Finding sliding window correlations in financial, physical, and other applications.2. Discovering bursts in large sensor data of gamma rays.3. Matching hums to recorded music, even when people don't hum well.4. Maintaining and manipulating time-ordered data in a database setting.This tutorial draws mostly from the book High Performance Discovery in Time Series: techniques and case studies, Springer-Verlag 2004. You can find the power point slides for this tutorial at http://cs.nyu.edu/cs/faculty/shasha/papers/sigmod04.ppt.The tutorial is aimed at researchers in streams, data mining, and scientific computing. Its applications should interest anyone who works with scientists or financial ""quants."" The emphasis will be on recent results and open problems. This is a ripe area for further advance." SIGMOD Conference Toward a Progress Indicator for Database Queries. Gang Luo,Jeffrey F. Naughton,Curt J. Ellmann,Michael Watzke 2004 Many modern software systems provide progress indicators for long-running tasks. These progress indicators make systems more user-friendly by helping the user quickly estimate how much of the task has been completed and when the task will finish. However, none of the existing commercial RDBMSs provides a non-trival progress indicator for long-running queries. In this paper, we consider the problem of supporting such progress indicators. After discussing the goals and challenges inherent in this problem, we present a set of techniques sufficient for implementing a simple yet useful progress indicator for a large subset of RDBMS queries. We report an initial implementation of these techniques in PostgreSQL. SIGMOD Conference Models for Web Services Transactions. Mark C. Little 2004 Models for Web Services Transactions. SIGMOD Conference Robust Query Processing through Progressive Optimization. Volker Markl,Vijayshankar Raman,David E. Simmen,Guy M. Lohman,Hamid Pirahesh 2004 Robust Query Processing through Progressive Optimization. SIGMOD Conference The Role of Cryptography in Database Security. Ueli M. Maurer 2004 In traditional database security research, the database is usually assumed to be trustworthy. Under this assumption, the goal is to achieve security against external attacks (e.g. from hackers) and possibly also against users trying to obtain information beyond their privileges, for instance by some type of statistical inference. However, for many database applications such as health information systems there exist conflicting interests of the database owner and the users or organizations interacting with the database, and also between the users. Therefore the database cannot necessarily be assumed to be fully trusted.In this extended abstract we address the problem of defining and achieving security in a context where the database is not fully trusted, i.e., when the users must be protected against a potentially malicious database. Moreover, we address the problem of the secure aggregation of databases owned by mutually mistrusting organisations, for example by competing companies. SIGMOD Conference CAMAS: A Citizen Awareness System for Crisis Mitigation. Sharad Mehrotra,Carter Butts,Dmitri V. Kalashnikov,Nalini Venkatasubramanian,Kemal Altintas,Ramaswamy Hariharan,Haimin Lee,Yiming Ma,Amnon Meyers,Jehan Wickramasuriya,Ron Eguchi,Charles Huyck 2004 CAMAS: A Citizen Awareness System for Crisis Mitigation. SIGMOD Conference XSeq: An Index Infrastructure for Tree Pattern Queries. Xiaofeng Meng,Yu Jiang,Yan Chen,Haixun Wang 2004 Given a tree-pattern query, most XML indexing approaches decompose it into multiple sub-queries, and then join their results to provide the answer to the original query. Join operations have been identified as the most time-consuming component in XML query processing. XSeq is a powerful XML indexing infrastructure which makes tree patterns a first class citizen in XML query processing. Unlike most indexing methods that directly manipulate tree structures, XSeq builds its indexing infrastructure on a much simpler data model: sequences. That is, we represent both XML data and XML queries by structure-encoded sequences. We have shown that this new data representation preserves query equivalence, and more importantly, through subsequence matching, structured queries can be answered directly without resorting to expensive join operations. Moreover, the XSeq infrastructure unifies indices on both the content and the structure of XML documents, hence it achieves an additional performance advantage over methods indexing either just content or structure, or indexing them separately. SIGMOD Conference Building Dynamic Application Networks with Web Services. Matthew Mihic 2004 "Looking at the state of the industry today, it is clear that we are in the early stages of Web Services development. Companies are still evaluating what the technology and considering how to apply it to their business. But over the past year, we seem to have reached an inflection point of companies building real systems based on Web Services. Partly this reflects an acceptance that the basic Web Services technologies - XML Schema [1][2], SOAP [3], WSDL [4] - have matured to the point where they can be used for mission critical applications. But it also reflects a growing understanding that Web Services enable a large class of systems that were previously very difficult to build. These systems are characterized by several critical properties:1. Rapid rates of change. The time is long past when companies could afford a year-long-effort to build out a new application. Businesses move at a faster pace today then ever before, and they are increasingly under pressure to do more work with fewer resources. This places a premium on the ability to build applications by quickly composing pre-existing services. The result is that systems are being connected in ways that were never imagined during development. This is reuse in the large - not just small services, but entire applications being linked together to solve a complex business function.2. Significant availability and scalability requirements. Many of these systems are ""bet-your-business"" types of applications. They have heavy scalability and availability requirements. Often then need to connect multiple partners and service hundreds of thousands of updates in a day, without ever suffering an interruption in service.3. Heterogeneous development tools and software platforms. Each of these applications typically involves components built using a wildly diverse set of tools, operating systems, and software platforms. Partly this is a result of building systems out of existing components - many of these components are locked into certain environments, and there are no resources to rewrite or migrate to a single homogenous platform. But it is also recognition that different problems are best solved by different toolsets. Some problems are best solved by writing code on an application server, others are best suited for scripting, and still others are solved by customizing an existing enterprise application. Heterogeneity is not going away. It is only increasing.4. Multiple domains of administrative control. An aspect of heterogeneity that is often overlooked is distributed ownership. As businesses merge, acquire, and partner with other companies, there is an increasing need to build applications that span organizational boundaries.These characteristics present a unique set of challenges to the way we think about developing, describing, connecting, and configuring applications. The challenges require us to develop new ways of looking at what it takes to build an application, and what makes up a network.In this session, we examine the nature of this next generation of application, and discuss the way in which Web Services are evolving to meet their needs. The session focuses on the development techniques that allow services to be easily and dynamically composed into rich applications, and considers the capabilities required of the underlying network fabric. The session concludes with an in-depth look at some of the critical Web Services specifications actively under development by industry leaders." SIGMOD Conference A Formal Analysis of Information Disclosure in Data Exchange. Gerome Miklau,Dan Suciu 2004 "We perform a theoretical study of the following query-view security problem: given a view V to be published, does V logically disclose information about a confidential query S? The problem is motivated by the need to manage the risk of unintended information disclosure in today's world of universal data exchange. We present a novel information-theoretic standard for query-view security. This criterion can be used to provide a precise analysis of information disclosure for a host of data exchange scenarios, including multi-party collusion and the use of outside knowledge by an adversary trying to learn privileged facts about the database. We prove a number of theoretical results for deciding security according to this standard. We also generalize our security criterion to account for prior knowledge a user or adversary may possess, and introduce techniques for measuring the magnitude of partial disclosures. We believe these results can be a foundation for practical efforts to secure data exchange frameworks, and also illuminate a nice interaction between logic and probability theory." SIGMOD Conference SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases. Mohamed F. Mokbel,Xiaopeng Xiong,Walid G. Aref 2004 This paper intoduces the Scalable INcremental hash-based Algorithm (SINA, for short); a new algorithm for evaluting a set of concurrent continuous spatio-temporal queries. SINA is designed with two goals in mind: (1) Scalability in terms of the number of concurrent continuous spatio-temporal queries, and (2) Incremental evaluation of continyous spatio-temporal queries. SINA achieves scalability by empolying a shared execution paradigm where the execution of continuous spatio-temporal queries is abstracted as a spatial join between a set of moving objects and a set of moving queries. Incremental evaluation is achived by computing only the updates of the previously reported answer. We introduce two types of updaes, namely positive and negative updates. Positive or negative updates indicate that a certain object should be added to or removed from the previously reported answer, respectively. SINA manages the computation of postive and negative updates via three phases: the hashing phase, the invalidation phase, and the joining phase. the hashing phase employs an in-memory hash-based join algorithm that results in a set a positive upldates. The invalidation phase is triggered every T seconds or when the memory is fully occupied to produce a set of negative updates. Finally, the joining phase is triggered by the end of the invalidation phase to produce a set of both positive and negative updates that result from joining in-memory data with in-disk data. Experimental results show that SINA is scalable and is more efficient than other index-based spatio-temporal algorithms. SIGMOD Conference Information Assurance Technology Challenges. Nicholas J. Multari 2004 Information Assurance Technology Challenges. SIGMOD Conference Incremental and Effective Data Summarization for Dynamic Hierarchical Clustering. Samer Nassar,Jörg Sander,Corrine Cheng 2004 Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization. SIGMOD Conference ORDPATHs: Insert-Friendly XML Node Labels. "Patrick E. O'Neil,Elizabeth J. O'Neil,Shankar Pal,Istvan Cseri,Gideon Schaller,Nigel Westbury" 2004 "We introduce a hierarchical labeling scheme called ORDPATH that is implemented in the upcoming version of Microsoft® SQL Server™. ORDPATH labels nodes of an XML tree without requiring a schema (the most general case---a schema simplifies the problem). An example of an ORDPATH value display format is ""1.5.3.9.1"". A compressed binary representation of ORDPATH provides document order by simple byte-by-byte comparison and ancestry relationship equally simply. In addition, the ORDPATH scheme supports insertion of new nodes at arbitrary positions in the XML tree, their ORDPATH values ""careted in"" between ORDPATHs of sibling nodes, without relabeling any old nodes." SIGMOD Conference Vertical and Horizontal Percentage Aggregations. Carlos Ordonez 2004 Existing SQL aggregate functions present important limitations to compute percentages. This article proposes two SQL aggregate functions to compute percentages addressing such limitations. The first function returns one row for each percentage in vertical form like standard SQL aggregations. The second function returns each set of percentages adding 100% on the same row in horizontal form. These novel aggregate functions are used as a framework to introduce the concept of percentage queries and to generate efficient SQL code. Experiments study different percentage query optimization strategies and compare evaluation time of percentage queries taking advantage of our proposed aggregations against queries using available OLAP extensions. The proposed percentage aggregations are easy to use, have wide applicability and can be efficiently evaluated. SIGMOD Conference Approximate XML Query Answers. Neoklis Polyzotis,Minos N. Garofalakis,Yannis E. Ioannidis 2004 The rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result in the development of on-line decision support systems, where users and analysts interactively explore large XML data sets through a declarative query interface (e.g., XQuery or XSLT). Given the importance of remaining interactive, such on-line systems can employ approximate query answers as an effective mechanism for reducing response time and providing users with early feedback. This approach has been successfully used in relational systems and it becomes even more compelling in the XML world, where the evaluation of complex queries over massive tree-structured data is inherently more expensive.In this paper, we initiate a study of approximate query answering techniques for large XML databases. Our approach is based on a novel, conceptually simple, yet very effective XML-summarization mechanism: TREESKETCH synopses. We demonstrate that, unlike earlier techniques focusing solely on selectivity estimation, our TREESKETCH synopses are much more effective in capturing the complete tree structure of the underlying XML database. We propose novel construction algorithms for building TREESKETCH summaries of limited size, and describe schemes for processing general XML twig queries over a concise TREESKETCH in order to produce very fast, approximate tree-structured query answers. To quantify the quality of such approximate answers, we propose a novel, intuitive error metric that captures the quality of the approximation in terms of both the overall structure of the XML tree and the distribution of document edges. Experimental results on real-life and synthetic data sets verify the effectiveness of our TREESKETCH synopses in producing fast, accurate approximate answers and demonstrate their benefits over previously proposed techniques that focus solely on selectivity estimation. In particular, TREESKETCHes yield faster, more accurate approximate answers and selectivity estimates, and are more efficient to construct. To the best of our knowledge, ours is the first work to address the timely problem of producing fast, approximate tree-structured answers for complex XML queries. SIGMOD Conference Constraint-Based XML Query Rewriting For Data Integration. Cong Yu,Lucian Popa 2004 We study the problem of answering queries through a target schema, given a set of mappings between one or more source schemas and this target schema, and given that the data is at the sources. The schemas can be any combination of relational or XML schemas, and can be independently designed. In addition to the source-to-target mappings, we consider as part of the mapping scenario a set of target constraints specifying additional properties on the target schema. This becomes particularly important when integrating data from multiple data sources with overlapping data and when such constraints can express data merging rules at the target. We define the semantics of query answering in such an integration scenario, and design two novel algorithms, basic query rewrite and query resolution, to implement the semantics. The basic query rewrite algorithm reformulates target queries in terms of the source schemas, based on the mappings. The query resolution algorithm generates additional rewritings that merge related information from multiple sources and assemble a coherent view of the data, by incorporating target constraints. The algorithms are implemented and then evaluated using a comprehensive set of experiments based on both synthetic and real-life data integration scenarios. SIGMOD Conference Tree Logical Classes for Efficient Evaluation of XQuery. Stelios Paparizos,Yuqing Wu,Laks V. S. Lakshmanan,H. V. Jagadish 2004 XML is widely praised for its flexibility in allowing repeated and missing sub-elements. However, this flexibility makes it challenging to develop a bulk algebra, which typically manipulates sets of objects with identical structure. A set of XML elements, say of type book, may have members that vary greatly in structure, e.g. in the number of author sub-elements. This kind of heterogeneity may permeate the entire document in a recursive fashion: e.g., different authors of the same or different book may in turn greatly vary in structure. Even when the document conforms to a schema, the flexible nature of schemas for XML still allows such significant variations in structure among elements in a collection. Bulk processing of such heterogeneous sets is problematic.In this paper, we introduce the notion of logical classes (LC) of pattern tree nodes, and generalize the notion of pattern tree matching to handle node logical classes. This abstraction pays off significantly in allowing us to reason with an inherently heterogeneous collection of elements in a uniform, homogeneous way. Based on this, we define a Tree Logical Class (TLC) algebra that is capable of handling the heterogeneity arising in XML query processing, while avoiding redundant work. We present an algorithm to obtain a TLC algebra expression from an XQuery statement (for a large fragment of XQuery). We show how to implement the TLC algebra efficiently, introducing the nest-join as an important physical operator for XML query processing. We show that evaluation plans generated using the TLC algebra not only are simpler but also perform better than those generated by competing approaches. TLC is the algebra used in the Timber [8] system developed at the University of Michigan. SIGMOD Conference STRIPES: An Efficient Index for Predicted Trajectories. Jignesh M. Patel,Yun Chen,V. Prasad Chakka 2004 Moving object databases are required to support queries on a large number of continuously moving objects. A key requirement for indexing methods in this domain is to efficiently support both update and query operations. Previous work on indexing such databases can be broadly divided into categories: indexing the past positions and indexing the future predicted positions. In this paper we focus on an efficient indexing method for indexing the future positions of moving objects.In this paper we propose an indexing method, called STRIPES, which indexes predicted trajectories in a dual transformed space. Trajectories for objects in d-dimensional space become points in a higher-dimensional 2d-space. This dual transformed space is then indexed using a regular hierarchical grid decomposition indexing structure. STRIPES can evaluate a range of queries including time-slice, window, and moving queries. We have carried out extensive experimental evaluation comparing the performance of STRIPES with the best known existing predicted trajectory index (the TPR*-tree), and show that our approach is significantly faster than TPR*-tree for both updates and search queries. SIGMOD Conference Canonical Abstraction for Outerjoin Optimization. Jun Rao,Hamid Pirahesh,Calisto Zuzarte 2004 "Outerjoins are an important class of joins and are widely used in various kinds of applications. It is challenging to optimize queries that contain outerjoins because outerjoins do not always commute with inner joins. Previous work has studied this problem and provided techniques that allow certain reordering of the join sequences. However, the optimization of outerjoin queries is still not as powerful as that of inner joins.An inner join query can always be canonically represented as a sequence of Cartesian products of all relations, followed by a sequence of selection operations, each applying a conjunct in the join predicates. This canonical abstraction is very powerful because it enables the optimizer to use any join sequence for plan generation. Unfortunately, such a canonical abstraction for outerjoin queries has not been developed. As a result, existing techniques always exclude certain join sequences from planning, which can lead to a severe performance penalty.Given a query consisting of a sequence of inner and outer joins, we, for the first time, present a canonical abstraction based on three operations: outer Cartesian products, nullification, and best match. Like the inner join abstraction, our outerjoin abstraction permits all join sequences, and preserves the property of both commutativity and transitivity among predicates. This allows us to generate plans that are very desirable for performance reasons but that couldn't be done before. We present an algorithm that produces such a canonical abstraction, and a method that extends an inner-join optimizer to generate plans in an expanded search space. We also describe an efficient implementation of the best match operation using the OLAP functionalities in SQL:1999. Our experimental results show that our technique can significantly improve the performance of outerjoin queries." SIGMOD Conference FAÇADE: A Fast and Effective Approach to the Discovery of Dense Clusters in Noisy Spatial Data. Yu Qian,Gang Zhang,Kang Zhang 2004 FAÇADE: A Fast and Effective Approach to the Discovery of Dense Clusters in Noisy Spatial Data. SIGMOD Conference Extending Query Rewriting Techniques for Fine-Grained Access Control. Shariq Rizvi,Alberto O. Mendelzon,S. Sudarshan,Prasan Roy 2004 "Current day database applications, with large numbers of users, require fine-grained access control mechanisms, at the level of individual tuples, not just entire relations/views, to control which parts of the data can be accessed by each user. Fine-grained access control is often enforced in the application code, which has numerous drawbacks; these can be avoided by specifying/enforcing access control at the database level. We present a novel fine-grained access control model based on authorization views that allows ""authorization-transparent"" querying; that is, user queries can be phrased in terms of the database relations, and are valid if they can be answered using only the information contained in these authorization views. We extend earlier work on authorization-transparent querying by introducing a new notion of validity, conditional validity. We give a powerful set of inference rules to check for query validity. We demonstrate the practicality of our techniques by describing how an existing query optimizer can be extended to perform access control checks by incorporating these inference rules." SIGMOD Conference Security of Shared Data in Large Systems: State of the Art and Research Directions. Arnon Rosenthal,Marianne Winslett 2004 "The goals of this tutorial are to enlighten the VLDB research community about the state of the art in data security, especially for enterprise or larger systems, and to engage the community's interest in improving the state of the art. The tutorial includes numerous suggested topics for research and development projects in data security." SIGMOD Conference Efficient set joins on similarity predicates. Sunita Sarawagi,Alok Kirpal 2004 In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets. SIGMOD Conference DataMIME™. Masum Serazi,Amal Perera,Qiang Ding,Vasiliy Malakhov,Imad Rahal,Fei Pan,Dongmei Ren,Weihua Wu,William Perrizo 2004 DataMIME™. SIGMOD Conference Highly-Available, Fault-Tolerant, Parallel Dataflows. Mehul A. Shah,Joseph M. Hellerstein,Eric A. Brewer 2004 We present a technique that masks failures in a cluster to provide high availability and fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to implement a variety of continuous query (CQ) applications that require high-throughput, 24x7 operation. Examples include network monitoring, phone call processing, click-stream processing, and online financial analysis. Our main contribution is a scheme that carefully integrates traditional query processing techniques for partitioned parallelism with the process-pairs approach for high availability. This delicate integration allows us to tolerate failures of portions of a parallel dataflow without sacrificing result quality. Upon failure, our technique provides quick fail-over, and automatically recovers the lost pieces on the fly. This piecemeal recovery provides minimal disruption to the ongoing dataflow computation and improved reliability as compared to the straight-forward application of the process-pairs technique on a per dataflow basis. Thus, our technique provides the high availability necessary for critical CQ applications. Our techniques are encapsulated in a reusable dataflow operator called Flux, an extension of the Exchange that is used to compose parallel dataflows. Encapsulating the fault-tolerance logic into Flux minimizes modifications to existing operator code and relieves the burden on the operator writer of repeatedly implementing and verifying this critical logic. We present experiments illustrating these features with an implementation of Flux in the TelegraphCQ code base [8]. SIGMOD Conference Prediction and Indexing of Moving Objects with Unknown Motion Patterns. Yufei Tao,Christos Faloutsos,Dimitris Papadias,Bin Liu 2004 Existing methods for peediction spatio-temporal databases assume that objects move according to linear functions. This severely limits their applicability, since in practice movement is more complex, and individual objects may follow drastically diffferent motion patterns. In order to overcome these problems, we first introduce a general framework for monitoring and indexing moving objects, where (i) each boject computes individually the function that accurately captures its movement and (ii) a server indexes the object locations at a coarse level and processes queries using a filter-refinement mechanism. Our second contribution is a novel recursive motion function that supports a broad class of non-linear motion patterns. The function does not presume any a-priori movement but can postulate the particular motion of each object by examining its locations at recent timestamps. Finally. we propse an efficient indexing scheme that faciliates the processing of predicitive queries without false misses. SIGMOD Conference Efficient Query Reformulation in Peer-Data Management Systems. Igor Tatarinov,Alon Y. Halevy 2004 "Peer data management systems (PDMS) offer a flexible architecture for decentralized data sharing. In a PDMS, every peer is associated with a schema that represents the peer's domain of interest, and semantic relationships between peers are provided locally between pairs (or small sets) of peers. By traversing semantic paths of mappings, a query over one peer can obtain relevant data from any reachable peer in the network. Semantic paths are traversed by reformulating queries at a peer into queries on its neighbors.Naively following semantic paths is highly inefficient in practice. We describe several techniques for optimizing the reformulation process in a PDMS and validate their effectiveness using real-life data sets. In particular, we develop techniques for pruning paths in the reformulation process and for minimizing the reformulated queries as they are created. In addition, we consider the effect of the strategy we use to search through the space of reformulations. Finally, we show that pre-computing semantic paths in a PDMS can greatly improve the efficiency of the reformulation process. Together, all of these techniques form a basis for scalable query reformulation in PDMS.To enable our optimizations, we developed practical algorithms, of independent interest, for checking containment and minimization of XML queries, and for composing XML mappings." SIGMOD Conference Implementing a Scalable XML Publish/Subscribe System Using a Relational Database System. Feng Tian,Berthold Reinwald,Hamid Pirahesh,Tobias Mayr,Jussi Myllymaki 2004 Implementing a Scalable XML Publish/Subscribe System Using a Relational Database System. SIGMOD Conference Identifying Similarities, Periodicities and Bursts for Online Search Queries. Michail Vlachos,Christopher Meek,Zografoula Vagena,Dimitrios Gunopulos 2004 "We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., 'Thanksgiving' or 'Christmas gifts') where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform 'query-by-burst' on the database of time-series. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database." SIGMOD Conference Online Event-driven Subsequence Matching over Financial Data Streams. Huanmei Wu,Betty Salzberg,Donghui Zhang 2004 Subsequence similarity matching in time series databases is an important research area for many applications. This paper presents a new approximate approach for automatic online subsequence similarity matching over massive data streams. With a simultaneous on-line segmentation and pruning algorithm over the incoming stream, the resulting piecewise linear representation of the data stream features high sensitivity and accuracy. The similarity definition is based on a permutation followed by a metric distance function, which provides the similarity search with flexibility, sensitivity and scalability. Also, the metric-based indexing methods can be applied for speed-up. To reduce the system burden, the event-driven similarity search is performed only when there is a potential event. The query sequence is the most recent subsequence of piecewise data representation of the incoming stream which is automatically generated by the system. The retrieved results can be analyzed in different ways according to the requirements of specific applications. This paper discusses an application for future data movement prediction based on statistical information. Experiments on real stock data are performed. The correctness of trend predictions is used to evaluate the performance of subsequence similarity matching. SIGMOD Conference An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. Wensheng Wu,Clement T. Yu,AnHai Doan,Weiyi Meng 2004 An increasing number of data sources now become available on the Web, but often their contents are only accessible through query interfaces. For a domain of interest, there often exist many such sources with varied coverage or querying capabilities. As an important step to the integration of these sources, we consider the integration of their query interfaces. More specifically, we focus on the crucial step of the integration: accurately matching the interfaces. While the integration of query interfaces has received more attentions recently, current approaches are not sufficiently general: (a) they all model interfaces with flat schemas; (b) most of them only consider 1:1 mappings of fields over the interfaces; (c) they all perform the integration in a blackbox-like fashion and the whole process has to be restarted from scratch if anything goes wrong; and (d) they often require laborious parameter tuning. In this paper, we propose an interactive, clustering-based approach to matching query interfaces. The hierarchical nature of interfaces is captured with ordered trees. Varied types of complex mappings of fields are examined and several approaches are proposed to effectively identify these mappings. We put the human integrator back in the loop and propose several novel approaches to the interactive learning of parameters and the resolution of uncertain mappings. Extensive experiments are conducted and results show that our approach is highly effective. SIGMOD Conference Graph Indexing: A Frequent Structure-based Approach. Xifeng Yan,Philip S. Yu,Jiawei Han 2004 Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the issues of indexing graphs and propose a novel solution by applying a graph mining technique. Different from the existing path-based methods, our approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. To reduce the size of index structure, two techniques, size-increasing support constraint and discriminative fragments, are introduced. Our performance study shows that gIndex has 10 times smaller index size, but achieves 3--10 times better performance in comparison with a typical path-based method, GraphGrep. The gIndex approach not only provides and elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit form data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be applied to indexing sequences, trees, and other complicated structures as well. SIGMOD Conference Incremental Maintenance of XML Structural Indexes. Ke Yi,Hao He,Ioana Stanoi,Jun Yang 2004 Increasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support efficient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-index, based on the notion of graph bisimilarity, and the recently proposed A(k)-index, based on the notion of local similarity to provide a trade-off between index size and query answering power. For these indexes to be practical, we need effective and efficient incremental maintenance algorithms to keep them consistent with the underlying data. However, existing update algorithms for structural indexes essentially provide no guarantees on the quality of the index; the updated index is usually larger size than necessary, degrading the performance for subsequent queries.In this paper, we propose update algorithms for the 1-index and the A(k)-index with provable guarantees on the resulting index quality. Our algorithms always maintain a minimal index, i.e., merging any two index nodes would result in an incorrect index. For the 1-index, if the data graph is acyclic, our algorithm further ensures that the index is minimum, i.e., it has the least number of index nodes possible. For the A(k)-index, we show that the minimal index our algorithm maintains is also the unique minimum A(k)-index, for both acyclic and cyclic data graphs. Finally, through experimental evaluation, we demonstrate that our algorithms bring significant improvement over previous methods, in terms of both index size and update time. SIGMOD Conference Clustering Objects on a Spatial Network. Man Lung Yiu,Nikos Mamoulis 2004 Clustering is one of the most important analysis tasks in spatial databases. We study the problem of clustering objects, which lie on edges of a large weighted spatial network. The distance between two objects is defined by their shortest path distance over the network. Past algorithms are based on the Euclidean distance and cannot be applied for this setting. We propose variants of partitioning, density-based, and hierarchical methods. Their effectiveness and efficiency is evaluated for collections of objects which appear on real road networks. The results show that our methods can correctly identify clusters and they are scalable for large problems. SIGMOD Conference Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Zhen Zhang,Bin He,Kevin Chen-Chuan Chang 2004 "Recently, the Web has been rapidly ""deepened"" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some ""concerted structure,"" by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach-it achieves above 85% accuracy for extracting query conditions across random sources." SIGMOD Conference Buffering Database Operations for Enhanced Instruction Cache Performance. Jingren Zhou,Kenneth A. Ross 2004 "As more and more query processing work can be done in main memory access is becoming a significant cost component of database operations. Recent database research has shown that most of the memory stalls are due to second-level cache data misses and first-level instruction cache misses. While a lot of research has focused on reducing the data cache misses, relatively little research has been done on improving the instruction cache performance of database systems.We first answer the question ""Why does a database system incur so many instruction cache misses?"" We demonstrate that current demand-pull pipelined query execution engines suffer from significant instruction cache thrashing between different operators. We propose techniques to buffer database operations during query execution to avoid instruction cache thrashing. We implement a new light-weight ""buffer"" operator and study various factors which may affect the cache performance. We also introduce a plan refinement algorithm that considers the query plan and decides whether it is beneficial to add additional ""buffer"" operators and where to put them. The benefit is mainly from better instruction locality and better hardware branch prediction. Our techniques can be easily integrated into current database systems without significant changes. Our experiments in a memory-resident PostgreSQL database system show that buffering techniques can reduce the number of instruction cache misses by up to 80% and improve query performance by up to 15%." SIGMOD Conference Dynamic Plan Migration for Continuous Queries Over Data Streams. Yali Zhu,Elke A. Rundensteiner,George T. Heineman 2004 Dynamic plan migration is concerned with the on-the-fly transition from one continuous query plan to a semantically equivalent yet more efficient plan. Migration is important for stream monitoring systems where long-running queries may have to withstand fluctuations in stream workloads and data characteristics. Existing migration methods generally adopt a pause-drain-resume strategy that pauses the processing of new data, purges all old data in the existing plan, until finally the new plan can be plugged into the system. However, these existing strategies do not address the problem of migrating query plans that contain stateful operators, such as joins. We now develop solutions for online plan migration for continuous stateful plans. In particular, in this paper, we propose two alternative strategies, called the moving state strategy and the parallel track strategy, one exploiting reusability and the second employs parallelism to seamlessly migrate between continuous join plans without affecting the results of the query. We develop cost models for both migration strategies to analytically compare them. We embed these migration strategies into the CAPE [7], a prototype system of a stream query engine, and conduct a comparative experimental study to evaluate these two strategies for window-based join plans. Our experimental results illustrate that the two strategies can vary significantly in terms of output rates and intermediate storage spaces given distinct system configurations and stream workloads. SIGMOD Conference Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004 Gerhard Weikum,Arnd Christian König,Stefan Deßloch 2004 Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004 VLDB Resource Sharing in Continuous Sliding-Window Aggregates. Arvind Arasu,Jennifer Widom 2004 We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address sliding-window aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques that cover a wide range of possible scenarios: different classes of aggregation functions (algebraic, distributive, holistic), different window types (time-based, tuple-based, suffix, historical), and different input models (single stream, multiple substreams). We provide precise theoretical performance guarantees for our techniques, and show their practical effectiveness through experimental study. VLDB An Integration Framework for Sensor Networks and Data Stream Management Systems. Daniel J. Abadi,Wolfgang Lindner,Samuel Madden,Jörg Schuler 2004 This demonstration shows an integrated query processing environment where users can seamlessly query both a data stream management system and a sensor network with one query expression. By integrating the two query processing systems, the optimization goals of the sensor network (primarily power) and server network (primarily latency and quality) can be unified into one quality of service metric. The demo shows various steps of the unified optimization process for a sample query where the effects of each step that the optimizer takes can be directly viewed using a quality of service monitor. Our demo includes sensors deployed in the demo area in a tiny mockup of a factory application. VLDB Whither Data Mining? Rakesh Agrawal,Ramakrishnan Srikant 2004 The last decade has witnessed tremendous advances in data mining. We take a retrospective look at these developments, focusing on association rules discovery, and discuss the challenges and opportunities ahead. VLDB P*TIME: Highly Scalable OLTP DBMS for Managing Update-Intensive Stream Workload. Sang Kyun Cha,Changbin Song 2004 "Over the past thirty years since the system R and Ingres projects started to lay the foundation for today's RDBMS implementations, the underlying hardware and software platforms have changed dramatically. However, the fundamental RDBMS architecture, especially, the storage engine architecture, largely remains unchanged. While this conventional architecture may suffices for satisfying most of today's applications, its deliverable performance range is far from meeting the so-called growing ""real-time enterprise"" demand of acquiring and querying high-volume update data streams cost-effectively. P*TIME is a new, memory-centric light-weight OLTP RDBMS designed and built from scratch to deliver orders of magnitude higher scalability on commodity SMP hardware than existing RDBMS implementations, not only in search but also in update performance. Its storage engine layer incorporates our previous innovations for exploiting engine-level microparallelism such as differential logging and optimistic latch-free index traversal concurrency control protocol. This paper presents the architecture and performance of P*TIME and reports our experience of deploying P*TIME as the stock market database server at one of the largest on-line brokerage firms." VLDB "An Electronic Patient Record ""on Steroids"": Distributed, Peer-to-Peer, Secure and Privacy-conscious." Serge Abiteboul,Bogdan Alexe,Omar Benjelloun,Bogdan Cautis,Irini Fundulaki,Tova Milo,Arnaud Sahuguet 2004 "An Electronic Patient Record ""on Steroids"": Distributed, Peer-to-Peer, Secure and Privacy-conscious." VLDB StreamMiner: A Classifier Ensemble-based Engine to Mine Concept-drifting Data Streams. Wei Fan 2004 We demonstrate StreamMiner, a random decision-tree ensemble based engine to mine data streams. A fundamental challenge in data stream mining applications (e.g., credit card transaction authorization, security buy-sell transaction, and phone call records, etc) is concept-drift or the discrepancy between the previously learned model and the true model in the new data. The basic problem is the ability to judiciously select data and adapt the old model to accurately match the changed concept of the data stream. StreamMiner uses several techniques to support mining over data streams with possible concept-drifts. We demonstrate the following two key functionalities of StreamMiner: 1. Detecting possible concept-drift on the fly when the trained streaming model is used to classify incoming data streams without knowing the ground truth. 2. Systematic data selection of old data and new data chunks to compute the optimal model that best fits on the changing data streams. VLDB Networked Query Processing for Distributed Stream-Based Applications. Yanif Ahmad,Ugur Çetintemel 2004 Networked Query Processing for Distributed Stream-Based Applications. VLDB Database Tuning Advisor for Microsoft SQL Server 2005. Sanjay Agrawal,Surajit Chaudhuri,Lubor Kollár,Arunprasad P. Marathe,Vivek R. Narasayya,Manoj Syamala 2004 Database Tuning Advisor for Microsoft SQL Server 2005. VLDB Automated Statistics Collection in DB2 UDB. Ashraf Aboulnaga,Peter J. Haas,Sam Lightstone,Guy M. Lohman,Volker Markl,Ivan Popivanov,Vijayshankar Raman 2004 Automated Statistics Collection in DB2 UDB. VLDB Database Architecture for New Hardware. Anastassia Ailamaki 2004 Database Architecture for New Hardware. VLDB Vision Paper: Enabling Privacy for the Paranoids. Gagan Aggarwal,Mayank Bawa,Prasanna Ganesan,Hector Garcia-Molina,Krishnaram Kenthapadi,Nina Mishra,Rajeev Motwani,Utkarsh Srivastava,Dilys Thomas,Jennifer Widom,Ying Xu 2004 "P3P [23, 24] is a set of standards that allow corporations to declare their privacy policies. Hippocratic Databases [6] have been proposed to implement such policies within a corporation's datastore. From an end-user individual's point of view, both of these rest on an uncomfortable philosophy of trusting corporations to protect his/her privacy. Recent history chronicles several episodes when such trust has been willingly or accidentally violated by corporations facing bankruptcy courts, civil subpoenas or lucrative mergers. We contend that data management solutions for information privacy must restore controls in the individual's hands. We suggest that enabling such control will require a radical re-think on modeling, release, and management of personal data." VLDB A Framework for Projected Clustering of High Dimensional Data Streams. Charu C. Aggarwal,Jiawei Han,Jianyong Wang,Philip S. Yu 2004 The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However, a lot of stream data is high-dimensional in nature. High-dimensional data is inherently more complex in clustering, classification, and similarity search. Recent research discusses methods for projected clustering over high-dimensional data sets. This method is however difficult to generalize to data streams because of the complexity of the method and the large volume of the data streams. In this paper, we propose a new, high-dimensional, projected data stream clustering method, called HPStream. The method incorporates a fading cluster structure, and the projection based clustering methodology. It is incrementally updatable and is highly scalable on both the number of dimensions and the size of the data streams, and it achieves better clustering quality in comparison with the previous stream clustering methods. Our performance study with both real and synthetic data sets demonstrates the efficiency and effectiveness of our proposed framework and implementation methods. VLDB Auditing Compliance with a Hippocratic Database. Rakesh Agrawal,Roberto J. Bayardo Jr.,Christos Faloutsos,Jerry Kiernan,Ralf Rantzau,Ramakrishnan Srikant 2004 "We introduce an auditing framework for determining whether a database system is adhering to its data disclosure policies. Users formulate audit expressions to specify the (sensitive) data subject to disclosure review. An audit component accepts audit expressions and returns all queries (deemed ""suspicious"") that accessed the specified data during their execution. The overhead of our approach on query processing is small, involving primarily the logging of each query string along with other minor annotations. Database triggers are used to capture updates in a backlog database. At the time of audit, a static analysis phase selects a subset of logged queries for further analysis. These queries are combined and transformed into an SQL audit query, which when run against the backlog database, identifies the suspicious queries efficiently and precisely. We describe the algorithms and data structures used in a DB2-based implementation of this framework. Experimental results reinforce our design choices and show the practicality of the approach." VLDB Linear Road: A Stream Data Management Benchmark. Arvind Arasu,Mitch Cherniack,Eduardo F. Galvez,David Maier,Anurag Maskey,Esther Ryvkina,Michael Stonebraker,Richard Tibbetts 2004 "This paper specifies the Linear Road Benchmark for Stream Data Management Systems (SDMS). Stream Data Management Systems process streaming data by executing continuous and historical queries while producing query results in real-time. This benchmark makes it possible to compare the performance characteristics of SDMS' relative to each other and to alternative (e.g., Relational Database) systems. Linear Road has been endorsed as an SDMS benchmark by the developers of both the Aurora [1] (out of Brandeis University, Brown University and MIT) and STREAM [8] (out of Stanford University) stream systems. Linear Road simulates a toll system for the motor vehicle expressways of a large metropolitan area. The tolling system uses ""variable tolling"" [6, 11, 9]: an increasingly prevalent tolling technique that uses such dynamic factors as traffic congestion and accident proximity to calculate toll charges. Linear Road specifies a variable tolling system for a fictional urban area including such features as accident detection and alerts, traffic congestion measurements, toll calculations and historical queries. After specifying the benchmark, we describe experimental results involving two implementations: one using a commercially available Relational Database and the other using Aurora. Our results show that a dedicated Stream Data Management System can outperform a Relational Database by at least a factor of 5 on streaming data applications." VLDB The Continued Saga of DB-IR Integration. Ricardo A. Baeza-Yates,Mariano P. Consens 2004 The Continued Saga of DB-IR Integration. VLDB WS-CatalogNet: An Infrastructure for Creating, Peering, and Querying e-Catalog Communities. Karim Baïna,Boualem Benatallah,Hye-Young Paik,Farouk Toumani,Christophe Rey,Agnieszka Rutkowska,Bryan Harianto 2004 WS-CatalogNet: An Infrastructure for Creating, Peering, and Querying e-Catalog Communities. VLDB Multi-objective Query Processing for Database Systems. Wolf-Tilo Balke,Ulrich Güntzer 2004 "Query processing in database systems has developed beyond mere exact matching of attribute values. Scoring database objects and retrieving only the top k matches or Pareto-optimal result sets (skyline queries) are already common for a variety of applications. Specialized algorithms using either paradigm can avoid naïve linear database scans and thus improve scalability. However, these paradigms are only two extreme cases of exploring viable compromises for each user's objectives. To find the correct result set for arbitrary cases of multi-objective query processing in databases we will present a novel algorithm for computing sets of objects that are nondominated with respect to a set of monotonic objective functions. Naturally containing top k and skyline retrieval paradigms as special cases, this algorithm maintains scalability also for all cases in between. Moreover, we will show the algorithm's correctness and instance-optimality in terms of necessary object accesses and how the response behavior can be improved by progressively producing result objects as quickly as possible, while the algorithm is still running." VLDB ObjectRank: Authority-Based Keyword Search in Databases. Andrey Balmin,Vagelis Hristidis,Yannis Papakonstantinou 2004 "The ObjectRank system applies authority-based ranking to keyword search in databases modeled as labeled graphs. Conceptually, authority originates at the nodes (objects) containing the keywords and flows to objects according to their semantic connections. Each node is ranked according to its authority with respect to the particular keywords. One can adjust the weight of global importance, the weight of each keyword of the query, the importance of a result actually containing the keywords versus being referenced by nodes containing them, and the volume of authority flow via each type of semantic connection. Novel performance challenges and opportunities are addressed. First, schemas impose constraints on the graph, which are exploited for performance purposes. Second, in order to address the issue of authority ranking with respect to the given keywords (as opposed to Google's global PageRank) we precompute single keyword ObjectRanks and combine them during run time. We conducted user surveys and a set of performance experiments on multiple real and synthetic datasets, to assess the semantic meaningfulness and performance of ObjectRank." VLDB A Framework for Using Materialized XPath Views in XML Query Processing. Andrey Balmin,Fatma Özcan,Kevin S. Beyer,Roberta Cochrane,Hamid Pirahesh 2004 XML languages, such as XQuery, XSLT and SQL/XML, employ XPath as the search and extraction language. XPath expressions often define complicated navigation, resulting in expensive query processing, especially when executed over large collections of documents. In this paper, we propose a framework for exploiting materialized XPath views to expedite processing of XML queries. We explore a class of materialized XPath views, which may contain XML fragments, typed data values, full paths, node references or any combination thereof. We develop an XPath matching algorithm to determine when such views can be used to answer a user query containing XPath expressions. We use the match information to identify the portion of an XPath expression in the user query which is not covered by the XPath view. Finally, we construct, possibly multiple, compensation expressions which need to be applied to the view to produce the query result. Experimental evaluation, using our prototype implementation, shows that the matching algorithm is very efficient and usually accounts for a small fraction of the total query compilation time. VLDB Hardware Acceleration in Commercial Databases: A Case Study of Spatial Operations. Nagender Bandi,Chengyu Sun,Amr El Abbadi,Divyakant Agrawal 2004 Hardware Acceleration in Commercial Databases: A Case Study of Spatial Operations. VLDB An Annotation Management System for Relational Databases. Deepavali Bhagwat,Laura Chiticariu,Wang Chiew Tan,Gaurav Vijayvargiya 2004 We present an annotation management system for relational databases. In this system, every piece of data in a relation is assumed to have zero or more annotations associated with it and annotations are propagated along, from the source to the output, as data is being transformed through a query. Such an annotation management system is important for understanding the provenance and quality of data, especially in applications that deal with integration of scientific and biological data. We present an extension, pSQL, of a fragment of SQL that has three different types of annotation propagation schemes, each useful for different purposes. The default scheme propagates annotations according to where data is copied from. The default-all scheme propagates annotations according to where data is copied from among all equivalent formulations of a given query. The custom scheme allows a user to specify how annotations should propagate. We present a storage scheme for the annotations and describe algorithms for translating a pSQL query under each propagation scheme into one or more SQL queries that would correctly retrieve the relevant annotations according to the specified propagation scheme. For the default-all scheme, we also show how we generate finitely many queries that can simulate the annotation propagation behavior of the set of all equivalent queries, which is possibly infinite. The algorithms are implemented and the feasibility of the system is demonstrated by a set of experiments that we have conducted. VLDB Technology Challenges in a Data Warehouse. Ramesh Bhashyam 2004 This presentation will discuss several database technology challenges that are faced when building a data warehouse. It will touch on the challenges posed by high capacity drives and the mechanisms in Teradata DBMS to address that. It will consider the features and capabilities required of a database in a mixed application environment of a warehouse and some solutions to address that. VLDB Object Fusion in Geographic Information Systems. Catriel Beeri,Yaron Kanza,Eliyahu Safra,Yehoshua Sagiv 2004 Given two geographic databases, a fusion algorithm should produce all pairs of corresponding objects (i.e., objects that represent the same real-world entity). Four fusion algorithms, which only use locations of objects, are described and their performance is measured in terms of recall and precision. These algorithms are designed to work even when locations are imprecise and each database represents only some of the real-world entities. Results of extensive experimentation are presented and discussed. The tests show that the performance depends on the density of the data sources and the degree of overlap among them. All four algorithms are much better than the current state of the art (i.e., the one-sided nearest-neighbor join). One of these four algorithms is best in all cases, at a cost of a small increase in the running time compared to the other algorithms. VLDB Returning Modified Rows - SELECT Statements with Side Effects. Andreas Behm,Serge Rielau,Richard Swagerman 2004 SQL in the IBM® DB2® Universal DatabaseTM for Linux®, UNIX®, and Windows® (DB2 UDB) database management product has been extended to support nested INSERT, UPDATE, and DELETE operations in SELECT statements. This allows database applications additional processing on modified rows. Within a single unit of work, applications can retrieve a result set containing the modified rows from a table or view modified by an SQL data-change operation. This eliminates the need to select the row after an INSERT or UPDATE, or before a DELETE statement. As a result, fewer network round trips, less server CPU time, fewer cursors, and less server memory are required. In addition, deadlocks can be avoided. The proposed approach is integrated with the set semantics of SQL, and does not require any procedural logic or modifications on the underlying relational data model. Pipelining multiple update, insert and delete operations using the same source data provides a very efficient way for multitable data-change statements typically found in ETL (extraction, transformation, load) applications. We demonstrate significant performance benefit with our experiences in the TPC-C benchmark. Experimental results show that the new SQL is more efficient in query execution compared to classic SQL. VLDB Managing Data from High-Throughput Genomic Processing: A Case Study. Toby Bloom,Ted Sharpe 2004 Genomic data has become the canonical example of very large, very complex data sets. As such, there has been significant interest in ways to provide targeted database support to address issues that arise in genomic processing. Whether genomic data is truly a special case, or just another application area exhibiting problems common to other domains, is an as yet unanswered question. In this abstract, we explore the structure and processing requirements of a large-scale genome sequencing center, as a case study of the issues that arise in genomic data managements, and as a means to compare those issues with those that arise in other domains. VLDB Computing Frequent Itemsets Inside Oracle 10G. Wei Li,Ari Mozes 2004 Frequent itemset counting is the first step for most association rule algorithms and some classification algorithms. It is the process of counting the number of occurrences of a set of items that happen across many transactions. The goal is to find those items which occur together most often. Expressing this functionality in RDBMS engines is difficult for two reasons. First, it leads to extremely inefficient execution when using existing RDBMS operations since they are not designed to handle this type of workload. Second, it is difficult to express the special output type of itemsets. In Oracle 10G, we introduce a new SQL table function which encapsulates the work of frequent itemset counting. It accepts the input dataset along with some user-configurable information, and it directly produces the frequent itemset results. We present examples of typical computations with frequent itemset counting inside Oracle 10G. We also describe how Oracle dynamically adapts during frequent itemset execution as a result of changes in the nature of the data as well as changes in the available system resources. VLDB Production Database Systems: Making Them Easy is Hard Work. David Campbell 2004 Enterprise capable database products have evolved into incredibly complex systems, some of which present hundreds of configuration parameters to the system administrator. So, while the processing and storage costs for maintaining large volumes of data have plummeted, the human costs associated with maintaining the data have continued to rise. In this presentation, we discuss the framework and approach used by the team who took Microsoft SQL Server from a state where it had several hundred configuration parameters to a system that can configure itself and respond to changes in workload and environment with little human intervention. VLDB "Integrating Automatic Data Acquisition with Business Processes - Experiences with SAP's Auto-ID Infrastructure." Christof Bornhövd,Tao Lin,Stephan Haller,Joachim Schaper 2004 "Smart item technologies, like RFID and sensor networks, are considered to be the next big step in business process automation [1]. Through automatic and real-time data acquisition, these technologies can benefit a great variety of industries by improving the efficiency of their operations. SAP's Auto-ID infrastructure enables the integration of RFID and sensor technologies with existing business processes. In this paper we give an overview of the existing infrastructure, discuss lessons learned from successful customer pilots, and point out some of the open research issues." VLDB Client-Based Access Control Management for XML documents. Luc Bouganim,François Dang Ngoc,Philippe Pucheral 2004 The erosion of trust put in traditional database servers and in Database Service Providers, the growing interest for different forms of data dissemination and the concern for protecting children from suspicious Internet content are different factors that lead to move the access control from servers to clients. Several encryption schemes can be used to serve this purpose but all suffer from a static way of sharing data. With the emergence of hardware and software security elements on client devices, more dynamic client-based access control schemes can be devised. This paper proposes an efficient client-based evaluator of access control rules for regulating access to XML documents. This evaluator takes benefit from a dedicated index to quickly converge towards the authorized parts of a - potentially streaming - document. Additional security mecanisms guarantee that prohibited data can never be disclosed during the processing and that the input document is protected from any form of tampering. Experiments on synthetic and real datasets demonstrate the effectiveness of the approach. VLDB From XML View Updates to Relational View Updates: old solutions to a new problem. Vanessa P. Braganholo,Susan B. Davidson,Carlos A. Heuser 2004 This paper addresses the question of updating relational databases through XML views. Using query trees to capture the notions of selection, projection, nesting, grouping, and heterogeneous sets found throughout most XML query languages, we show how XML views expressed using query trees can be mapped to a set of corresponding relational views. We then show how updates on the XML view are mapped to updates on the corresponding relational views. Existing work on updating relational views can then be leveraged to determine whether or not the relational views are updatable with respect to the relational updates, and if so, to translate the updates to the underlying relational database. VLDB On The Marriage of Lp-norms and Edit Distance. Lei Chen,Raymond T. Ng 2004 "Existing studies on time series are based on two categories of distance functions. The first category consists of the Lp-norms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first contribution of this paper is the proposal of a new distance function, which we call ERP (""Edit distance with Real Penalty""). Representing a marriage of L1- norm and the edit distance, ERP can support local time shifting, and is a metric. The second contribution of the paper is the development of pruning strategies for large time series databases. Given that ERP is a metric, one way to prune is to apply the triangle inequality. Another way to prune is to develop a lower bound on the ERP distance. We propose such a lower bound, which has the nice computational property that it can be efficiently indexed with a standard B+- tree. Moreover, we show that these two ways of pruning can be used simultaneously for ERP distances. Specifically, the false positives obtained from the B+-tree can be further minimized by applying the triangle inequality. Based on extensive experimentation with existing benchmarks and techniques, we show that this combination delivers superb pruning power and search time performance, and dominates all existing strategies." VLDB Taming XPath Queries by Minimizing Wildcard Steps. Chee Yong Chan,Wenfei Fan,Yiming Zeng 2004 This paper presents a novel and complementary technique to optimize an XPath query by minimizing its wildcard steps. Our approach is based on using a general composite axis called the layer axis, to rewrite a sequence of XPath steps (all of which are wildcard steps except for possibly the last) into a single layer-axis step. We describe an efficient implementation of the layer axis and present a novel and efficient rewriting algorithm to minimize both non-branching as well as branching wildcard steps in XPath queries. We also demonstrate the usefulness of wildcard-step elimination by proposing an optimized evaluation strategy for wildcard-free XPath queries that enables selective loading of only the relevant input XML data for query evaluation. Our experimental results not only validate the scalability and efficiency of our optimized evaluation strategy, but also demonstrate the effectiveness of our rewriting algorithm for minimizing wildcard steps in XPath queries. To the best of our knowledge, this is the first effort that addresses this new optimization problem. VLDB Remembrance of Streams Past: Overload-Sensitive Management of Archived Streams. Sirish Chandrasekaran,Michael J. Franklin 2004 This paper studies Data Stream Management Systems that combine real-time data streams with historical data, and hence access incoming streams and archived data simultaneously. A significant problem for these systems is the I/O cost of fetching historical data which inhibits processing of the live data streams. Our solution is to reduce the I/O cost for accessing the archive by retrieving only a reduced (summarized or sampled) version of the historical data. This paper does not propose new summarization or sampling techniques, but rather a framework in which multiple resolutions of summarization/sampling can be generated efficiently. The query engine can select the appropriate level of summarization to use depending on the resources currently available. The central research problem studied is whether to generate the multiple representations of archived data eagerly upon data-arrival, lazily at query-time, or in a hybrid fashion. Concrete techniques for each approach are presented, which are tied to a specific data reduction technique (random sampling). The tradeoffs among the three approaches are studied both analytically and experimentally. VLDB A Uniform System for Publishing and Maintaining XML Data. Byron Choi,Wenfei Fan,Xibei Jia,Arek Kasprzyk 2004 A Uniform System for Publishing and Maintaining XML Data. VLDB Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data. Reynold Cheng,Yuni Xia,Sunil Prabhakar,Rahul Shah,Jeffrey Scott Vitter 2004 It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, it is only necessary to know whether the probability exceeds a given threshold - we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries. In particular, we develop two index structures and associated algorithms to efficiently answer PTQs. The first index scheme is based on the idea of augmenting uncertainty information to an R-tree. We establish the difficulty of this problem by mapping one-dimensional intervals to a two-dimensional space, and show that the problem of interval indexing with probabilities is significantly harder than interval indexing which is considered a well-studied problem. To overcome the limitations of this R-tree based structure, we apply a technique we call variance-based clustering, where data points with similar degrees of uncertainty are clustered together. Our extensive index structure can answer the queries for various kinds of uncertainty pdfs, in an almost optimal sense. We conduct experiments to validate the superior performance of both indexing schemes. VLDB Probabilistic Ranking of Database Query Results. Surajit Chaudhuri,Gautam Das,Vagelis Hristidis,Gerhard Weikum 2004 We investigate the problem of ranking answers to a database query when many tuples are returned. We adapt and apply principles of probabilistic models from Information Retrieval for structured data. Our proposed solution is domain independent. It leverages data and workload statistics and correlations. Our ranking functions can be further customized for different applications. We present results of preliminary experiments which demonstrate the efficiency as well as the quality of our ranking system. VLDB Self-Managing Technology in Database Management Systems. Surajit Chaudhuri,Benoît Dageville,Guy M. Lohman 2004 Self-Managing Technology in Database Management Systems. VLDB Managing RDFI Data. Sudarshan S. Chawathe,Venkat Krishnamurthy,Sridhar Ramachandran,Sanjay E. Sarma 2004 Radio-Frequency Identification (RFID) technology enables sensors to efficiently and inexpensively track merchandise and other objects. The vast amount of data resulting from the proliferation of RFID readers and tags poses some interesting challenges for data management. We present a brief introduction to RFID technology and highlight a few of the data management challenges. VLDB HiFi: A Unified Architecture for High Fan-in Systems. Owen Cooper,Anil Edakkunni,Michael J. Franklin,Wei Hong,Shawn R. Jeffery,Sailesh Krishnamurthy,Frederick Reiss,Shariq Rizvi,Eugene Wu 2004 "Advances in data acquisition and sensor technologies are leading towards the development of ""High Fan-in"" architectures: widely distributed systems whose edges consist of numerous receptors such as sensor networks and RFID readers and whose interior nodes consist of traditional host computers organized using the principle of successive aggregation. Such architectures pose significant new data management challenges. The HiFi system, under development at UC Berkeley, is aimed at addressing these challenges. We demonstrate an initial prototype of HiFi that uses data stream query processing to acquire, filter, and aggregate data from multiple devices including sensor motes, RFID readers, and low power gateways organized as a High Fan-in system." VLDB An Automatic Data Grabber for Large Web Sites. Valter Crescenzi,Giansalvatore Mecca,Paolo Merialdo,Paolo Missier 2004 We demonstrate a system to automatically grab data from data intensive web sites. The system first infers a model that describes at the intensional level the web site as a collection of classes; each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model a library of wrappers, one per class, is then inferred, with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data. VLDB Lifting the Burden of History from Adaptive Query Processing. Amol Deshpande,Joseph M. Hellerstein 2004 Adaptive query processing schemes attempt to re-optimize query plans during the course of query execution. A variety of techniques for adaptive query processing have been proposed, varying in the granularity at which they can make decisions [8]. The eddy [1] is the most aggressive of these techniques, with the flexibility to choose tuple-by-tuple how to order the application of operators. In this paper we identify and address a fundamental limitation of the original eddies proposal: the burden of history in routing. We observe that routing decisions have long-term effects on the state of operators in the query, and can severely constrain the ability of the eddy to adapt over time. We then propose a mechanism we call STAIRs that allows the query engine to manipulate the state stored inside the operators and undo the effects of past routing decisions. We demonstrate that eddies with STAIRs achieve both high adaptivity and good performance in the face of uncertainty, outperforming prior eddy proposals by orders of magnitude. VLDB Efficient Constraint Processing for Highly Personalized Location Based Services. Zhengdao Xu,Hans-Arno Jacobsen 2004 Efficient Constraint Processing for Highly Personalized Location Based Services. VLDB PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. Conor Cunningham,Goetz Graefe,César A. Galindo-Legaria 2004 PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. VLDB The NEXT Logical Framework for XQuery. Alin Deutsch,Yannis Papakonstantinou,Yu Xu 2004 The NEXT Logical Framework for XQuery. VLDB Automatic SQL Tuning in Oracle 10g. Benoît Dageville,Dinesh Das,Karl Dias,Khaled Yagoub,Mohamed Zaït,Mohamed Ziauddin 2004 "SQL tuning is a very critical aspect of database performance tuning. It is an inherently complex activity requiring a high level of expertise in several domains: query optimization, to improve the execution plan selected by the query optimizer; access design, to identify missing access structures; and SQL design, to restructure and simplify the text of a badly written SQL statement. Furthermore, SQL tuning is a time consuming task due to the large volume and evolving nature of the SQL workload and its underlying data. In this paper we present the new Automatic SQL Tuning feature of Oracle 10g. This technology is implemented as a core enhancement of the Oracle query optimizer and offers a comprehensive solution to the SQL tuning challenges mentioned above. Automatic SQL Tuning introduces the concept of SQL profiling to transparently improve execution plans. It also generates SQL tuning recommendations by performing cost-based access path and SQL structure ""what-if"" analyses. This feature is exposed to the user through both graphical and command line interfaces. The Automatic SQL Tuning is an integral part of the Oracle's framework for self-managing databases. The superiority of this new technology is demonstrated by comparing the results of Automatic SQL Tuning to manual tuning using a real customer workload." VLDB Towards an Internet-Scale XML Dissemination Service. Yanlei Diao,Shariq Rizvi,Michael J. Franklin 2004 Publish/subscribe systems have demonstrated the ability to scale to large numbers of users and high data rates when providing content-based data dissemination services on the Internet. However, their services are limited by the data semantics and query expressiveness that they support. On the other hand, the recent work on selective dissemination of XML data has made significant progress in moving from XML filtering to the richer functionality of transformation for result customization, but in general has ignored the challenges of deploying such XML-based services on an Internet-scale. In this paper, we address these challenges in the context of incorporating the rich functionality of XML data dissemination in a highly scalable system. We present the architectural design of ONYX, a system based on an overlay network. We identify the salient technical challenges in supporting XML filtering and transformation in this environment and propose techniques for solving them. VLDB Efficient Query Evaluation on Probabilistic Databases. Nilesh N. Dalvi,Dan Suciu 2004 "We describe a system that supports arbitrarily complex SQL queries on probabilistic databases. The query semantics is based on a probabilistic model and the results are ranked, much like in Information Retrieval. Our main focus is efficient query evaluation, a problem that has not received attention in the past. We describe an optimization algorithm that can compute efficiently most queries. We show, however, that the data complexity of some queries is #P-complete, which implies that these queries do not admit any efficient evaluation methods. For these queries we describe both an approximation algorithm and a Monte-Carlo simulation algorithm." VLDB Supporting Ontology-Based Semantic matching in RDBMS. Souripriya Das,Eugene Inseok Chong,George Eadon,Jagannathan Srinivasan 2004 Ontologies are increasingly being used to build applications that utilize domain-specific knowledge. This paper addresses the problem of supporting ontology-based semantic matching in RDBMS. Specifically, 1) A set of SQL operators, namely ONT_RELATED, ONT_EXPAND, ONT_DISTANCE, and ONT_PATH, are introduced to perform ontology-based semantic matching, 2) A new indexing scheme ONT_INDEXTYPE is introduced to speed up ontology-based semantic matching operations, and 3) System-defined tables are provided for storing ontologies specified in OWL. Our approach enables users to reference ontology data directly from SQL using the semantic match operators, thereby opening up possibilities of combining with other operations such as joins as well as making the ontology-driven applications easy to develop and efficient. In contrast, other approaches use RDBMS only for storage of ontologies and querying of ontology data is typically done via APIs. This paper presents the ontology-related functionality including inferencing, discusses how it is implemented on top of Oracle RDBMS, and illustrates the usage with several database applications. VLDB Distributed Set Expression Cardinality Estimation. Abhinandan Das,Sumit Ganguly,Minos N. Garofalakis,Rajeev Rastogi 2004 We consider the problem of estimating set-expression cardinality in a distributed streaming environment where rapid update streams originating at remote sites are continually transmitted to a central processing system. At the core of our algorithmic solutions for answering set-expression cardinality queries are two novel techniques for lowering data communication costs without sacrificing answer precision. Our first technique exploits global knowledge of the distribution of certain frequently occurring stream elements to significantly reduce the transmission of element state information to the central site. Our second technical contribution involves a novel way of capturing the semantics of the input set expression in a boolean logic formula, and using models (of the formula) to determine whether an element state change at a remote site can affect the set expression result. Results of our experimental study with real-life as well as synthetic data sets indicate that our distributed set-expression cardinality estimation algorithms achieve substantial reductions in message traffic compared to naive approaches that provide the same accuracy guarantees. VLDB Simlarity Search for Web Services. Xin Dong,Alon Y. Halevy,Jayant Madhavan,Ema Nemes,Jun Zhang 2004 Web services are loosely coupled software components, published, located, and invoked across the web. The growing number of web services available within an organization and on the Web raises a new and challenging search problem: locating desired web services. Traditional keyword search is insufficient in this context: the specific types of queries users require are not captured, the very small text fragments in web services are unsuitable for keyword search, and the underlying structure and semantics of the web services are not exploited. We describe the algorithms underlying the Woogle search engine for web services. Woogle supports similarity search for web services, such as finding similar web-service operations and finding operations that compose with a given one. We describe novel techniques to support these types of searches, and an experimental study on a collection of over 1500 web-service operations that shows the high recall and precision of our algorithms. VLDB Containment of Nested XML Queries. Xin Dong,Alon Y. Halevy,Igor Tatarinov 2004 Query containment is the most fundamental relationship between a pair of database queries: a query Q is said to be contained in a query Q′ if the answer for Q is always a subset of the answer for Q′, independent of the current state of the database. Query containment is an important problem in a wide variety of data management applications, including verification of integrity constraints, reasoning about contents of data sources in data integration, semantic caching, verification of knowledge bases, determining queries independent of updates, and most recently, in query reformulation for peer data management systems. Query containment has been studied extensively in the relational context and for XPath queries, but not for XML queries with nesting. We consider the theoretical aspects of the problem of query containment for XML queries with nesting. We begin by considering conjunctive XML queries (c-XQueries), and show that containment is in polynomial time if we restrict the fanout (number of sibling sub-blocks) to be 1. We prove that for arbitrary fanout, containment is coNP-hard already for queries with nesting depth 2, even if the query does not include variables in the return clauses. We then show that for queries with fixed nesting depth, containment is coNP-complete. Next, we establish the computational complexity of query containment for several practical extensions of c-XQueries, including queries with union and arithmetic comparisons, and queries where the XPath expressions may include descendant edges and negation. Finally, we describe a few heuristics for speeding up query containment checking in practice by exploiting properties of the queries and the underlying schema. VLDB ShreX: Managing XML Documents in Relational Databases. Fang Du,Sihem Amer-Yahia,Juliana Freire 2004 We describe ShreX, a freely-available system for shredding, loading and querying XML documents in relational databases. ShreX supports all mapping strategies proposed in the literature as well as strategies available in commercial RDBMSs. It provides generic (mapping-independent) functions for loading shredded documents into relations and for translating XML queries into SQL. ShreX is portable and can be used with any relational database backend. VLDB Accurate and Efficient Crawling for Relevant Websites. Martin Ester,Hans-Peter Kriegel,Matthias Schubert 2004 Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant webpages, there are various applications which target whole websites instead of single webpages. For example, companies are represented by websites, not by individual webpages. To answer queries targeted at websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on a two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the webpages of a single given website and performs focused (page) crawling within that website. Our experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were adapted to retrieve websites instead of single webpages. VLDB Model-Driven Data Acquisition in Sensor Networks. Amol Deshpande,Carlos Guestrin,Samuel Madden,Joseph M. Hellerstein,Wei Hong 2004 "Declarative queries are proving to be an attractive paradigm for ineracting with networks of wireless sensors. The metaphor that ""the sensornet is a database"" is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings onto physical reality, a model of that reality is required to complement the readings. In this paper, we enrich interactive sensor querying with statistical modeling techniques. We demonstrate that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy. Utilizing the combination of a model and live data acquisition raises the challenging optimization problem of selecting the best sensor readings to acquire, balancing the increase in the confidence of our answer against the communication and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution to this optimization problem, and a polynomial-time heuristic for identifying solutions that perform well in practice. We evaluate our approach on several real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition techniques." VLDB Relational link-based ranking. Floris Geerts,Heikki Mannila,Evimaria Terzi 2004 Link analysis methods show that the interconnections between web pages have lots of valuable information. The link analysis methods are, however, inherently oriented towards analyzing binary relations. We consider the question of generalizing link analysis methods for analyzing relational databases. To this aim, we provide a generalized ranking framework and address its practical implications. More specically, we associate with each relational database and set of queries a unique weighted directed graph, which we call the database graph. We explore the properties of database graphs. In analogy to link analysis algorithms, which use the Web graph to rank web pages, we use the database graph to rank partial tuples. In this way we can, e.g., extend the PageRank link analysis algorithm to relational databases and give this extension a random querier interpretation. Similarly, we extend the HITS link analysis algorithm to relational databases. We conclude with some preliminary experimental results. VLDB High Performance Index Build Algorithms for Intranet Search Engines. Marcus Fontoura,Eugene J. Shekita,Jason Y. Zien,Sridhar Rajagopalan,Andreas Neumann 2004 There has been a substantial amount of research on high-performance algorithms for constructing an inverted text index. However, constructing the inverted index in a intranet search engine is only the final step in a more complicated index build process. Among other things, this process requires an analysis of all the data being indexed to compute measures like PageRank. The time to perform this global analysis step is significant compared to the time to construct the inverted index, yet it has not received much attention in the research literature. In this paper, we describe how the use of slightly outdated information from global analysis and a fast index construction algorithm based on radix sorting can be combined in a novel way to significantly speed up the index build process without sacrificing search quality. VLDB Queries and Updates in the coDB Peer to Peer Database System. Enrico Franconi,Gabriel M. Kuper,Andrei Lopatenko,Ilya Zaihrayeu 2004 In this short paper we present the coDB P2P DB system. A network of databases, possibly with different schemas, are interconnected by means of GLAV coordination rules, which are inclusions of conjunctive queries, with possibly existential variables in the head; coordination rules may be cyclic. Each node can be queried in its schema for data, which the node can fetch from its neighbours, if a coordination rule is involved. VLDB Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems. Prasanna Ganesan,Mayank Bawa,Hector Garcia-Molina 2004 We consider the problem of horizontally partitioning a dynamic relation across a large number of disks/nodes by the use of range partitioning. Such partitioning is often desirable in large-scale parallel databases, as well as in peer-to-peer (P2P) systems. As tuples are inserted and deleted, the partitions may need to be adjusted, and data moved, in order to achieve storage balance across the participant disks/nodes. We propose efficient, asymptotically optimal algorithms that ensure storage balance at all times, even against an adversarial insertion and deletion of tuples. We combine the above algorithms with distributed routing structures to architect a P2P system that supports efficient range queries, while simultaneously guaranteeing storage balance. VLDB Write-Optimized B-Trees. Goetz Graefe 2004 "Large writes are beneficial both on individual disks and on disk arrays, e.g., RAID-5. The presented design enables large writes of internal B-tree nodes and leaves. It supports both in-place updates and large append-only (""log-structured"") write operations within the same storage volume, within the same B-tree, and even at the same time. The essence of the proposal is to make page migration inexpensive, to migrate pages while writing them, and to make such migration optional rather than mandatory as in log-structured file systems. The inexpensive page migration also aids traditional defragmentation as well as consolidation of free space needed for future large writes. These advantages are achieved with a very limited modification to conventional B-trees that also simplifies other B-tree operations, e.g., key range locking and compression. Prior proposals and prototypes implemented transacted B-tree on top of log-structured file systems and added transaction support to log-structured file systems. Instead, the presented design adds techniques and performance characteristics of log-structured file systems to traditional B-trees and their standard transaction support, notably without adding a layer of indirection for locating B-tree nodes on disk. The result retains fine-granularity locking, full transactional ACID guarantees, fast search performance, etc. expected of a modern B-tree implementation, yet adds efficient transacted page relocation and large, high-bandwidth writes." VLDB COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data. Jens Graupmann,Michael Biwer,Christian Zimmer,Patrick Zimmer,Matthias Bender,Martin Theobald,Gerhard Weikum 2004 COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data. VLDB XQuery on SQL Hosts. Torsten Grust,Sherif Sakr,Jens Teubner 2004 "Relational database systems may be turned into efficient XML and XPath processors if the system is provided with a suitable relational tree encoding. This paper extends this relational XML processing stack and shows that an RDBMS can also serve as a highly efficient XQuery runtime environment. Our approach is purely relational: XQuery expressions are compiled into SQL code which operates on the tree encoding. The core of the compilation procedure trades XQuery's notions of variable scopes and nested iteration (FLWOR blocks) for equi-joins. The resulting relational XQuery processor closely adheres to the language semantics, e.g., it obeys node identity as well as document and sequence order, and can support XQuery's full axis feature. The system exhibits quite promising performance figures in experiments. Somewhat unexpectedly, we will also see that the XQuery compiler can make good use of SQL's OLAP functionality." VLDB Merging the Results of Approximate Match Operations. Sudipto Guha,Nick Koudas,Amit Marathe,Divesh Srivastava 2004 "Data Cleaning is an important process that has been at the center of research interest in recent years. An important end goal of effective data cleaning is to identify the relational tuple or tuples that are ""most related"" to a given query tuple. Various techniques have been proposed in the literature for efficiently identifying approximate matches to a query string against a single attribute of a relation. In addition to constructing a ranking (i.e., ordering) of these matches, the techniques often associate, with each match, scores that quantify the extent of the match. Since multiple attributes could exist in the query tuple, issuing approximate match operations for each of them separately will effectively create a number of ranked lists of the relation tuples. Merging these lists to identify a final ranking and scoring, and returning the top-K tuples, is a challenging task. In this paper, we adapt the well-known footrule distance (for merging ranked lists) to effectively deal with scores. We study efficient algorithms to merge rankings, and produce the top-K tuples, in a declarative way. Since techniques for approximately matching a query string against a single attribute in a relation are typically best deployed in a database, we introduce and describe two novel algorithms for this problem and we provide SQL specifications for them. Our experimental case study, using real application data along with a realization of our proposed techniques on a commercial data base system, highlights the benefits of the proposed algorithms and attests to the overall effectiveness and practicality of our approach." VLDB XWAVE: Approximate Extended Wavelets for Streaming Data. Sudipto Guha,Chulyun Kim,Kyuseok Shim 2004 XWAVE: Approximate Extended Wavelets for Streaming Data. VLDB REHIST: Relative Error Histogram Construction Algorithms. Sudipto Guha,Kyuseok Shim,Jungchul Woo 2004 Histograms and Wavelet synopses provide useful tools in query optimization and approximate query answering. Traditional histogram construction algorithms, such as V-Optimal, optimize absolute error measures for which the error in estimating a true value of 10 by 20 has the same effect of estimating a true value of 1000 by 1010. However, several researchers have recently pointed out the drawbacks of such schemes and proposed wavelet based schemes to minimize relative error measures. None of these schemes provide satisfactory guarantees - and we provide evidence that the difficulty may lie in the choice of wavelets as the representation scheme. In this paper, we consider histogram construction for the known relative error measures. We develop optimal as well as fast approximation algorithms. We provide a comprehensive theoretical analysis and demonstrate the effectiveness of these algorithms in providing significantly more accurate answers through synthetic and real life data sets. VLDB Combating Web Spam with TrustRank. Zoltán Gyöngyi,Hector Garcia-Molina,Jan O. Pedersen 2004 "Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites." VLDB Discovering and Ranking Semantic Associations over a Large RDF Metabase. Christian Halaschek-Wiener,Boanerges Aleman-Meza,Ismailcem Budak Arpinar,Amit P. Sheth 2004 "Information retrieval over semantic metadata has recently received a great amount of interest in both industry and academia. In particular, discovering complex and meaningful relationships among this data is becoming an active research topic. Just as ranking of documents is a critical component of today's search engines, the ranking of relationships will be essential in tomorrow's semantic analytics engines. Building upon our recent work on specifying these semantic relationships, which we refer to as Semantic Associations, we demonstrate a system where these associations are discovered among a large semantic metabase represented in RDF. Additionally we employ ranking techniques to provide users with the most interesting and relevant results." VLDB Structures, Semantics and Statistics. Alon Y. Halevy 2004 At a fundamental level, the key challenge in data integration is to reconcile the semantics of disparate data sets, each expressed with a different database structure. I argue that computing statistics over a large number of structures offers a powerful methodology for producing semantic mappings, the expressions that specify such reconciliation. In essence, the statistics offer hints about the semantics of the symbols in the structures, thereby enabling the detection of semantically similar concepts. The same methodology can be applied to several other data management tasks that involve search in a space of complex structures and in enabling the next-generation on-the-fly data integration systems. VLDB ROX: Relational Over XML. Alan Halverson,Vanja Josifovski,Guy M. Lohman,Hamid Pirahesh,Mathias Mörschel 2004 "An increasing percentage of the data needed by business applications is being generated in XML format. Storing the XML in its native format will facilitate new applications that exchange business objects in XML format and query portions of XML documents using XQuery. This paper explores the feasibility of accessing natively-stored XML data through traditional SQL interfaces, called Relational Over XML (ROX), in order to avoid the costly conversion of legacy applications to XQuery. It describes the forces that are driving the industry to evolve toward the ROX scenario as well as some of the issues raised by ROX. The impact of denormalization of data in XML documents is discussed both from a semantic and performance perspective. We also weigh the implications of ROX for manageability and query optimization. We experimentally compared the performance of a prototype of the ROX scenario to today's SQL engines, and found that good performance can be achieved through a combination of utilizing XML's hierarchical storage to store relations ""pre-joined"" as well as creating indices over the remaining join columns. We have developed an experimental framework using DB2 8.1 for Linux, Unix and Windows, and have gathered initial performance results that validate this approach." VLDB STEPS towards Cache-resident Transaction Processing. Stavros Harizopoulos,Anastassia Ailamaki 2004 Online transaction processing (OLTP) is a multibillion dollar industry with high-end database servers employing state-of-the-art processors to maximize performance. Unfortunately, recent studies show that CPUs are far from realizing their maximum intended throughput because of delays in the processor caches. When running OLTP, instruction-related delays in the memory subsystem account for 25 to 40% of the total execution time. In contrast to data, instruction misses cannot be overlapped with out-of-order execution, and instruction caches cannot grow as the slower access time directly affects the processor speed. The challenge is to alleviate the instruction-related delays without increasing the cache size. We propose Steps, a technique that minimizes instruction cache misses in OLTP workloads by multiplexing concurrent transactions and exploiting common code paths. One transaction paves the cache with instructions, while close followers enjoy a nearly miss-free execution. Steps yields up to 96.7% reduction in instruction cache misses for each additional concurrent transaction, and at the same time eliminates up to 64% of mispredicted branches by loading a repeating execution pattern into the CPU. This paper (a) describes the design and implementation of Steps, (b) analyzes Steps using microbenchmarks, and (c) shows Steps performance when running TPC-C on top of the Shore storage manager. VLDB Architectures and Algorithms for Internet-Scale (P2P) Data Management. Joseph M. Hellerstein 2004 Architectures and Algorithms for Internet-Scale (P2P) Data Management. VLDB A Privacy-Preserving Index for Range Queries. Bijit Hore,Sharad Mehrotra,Gene Tsudik 2004 Database outsourcing is an emerging data management paradigm which has the potential to transform the IT operations of corporations. In this paper we address privacy threats in database outsourcing scenarios where trust in the service provider is limited. Specifically, we analyze the data partitioning (bucketization) technique and algorithmically develop this technique to build privacy-preserving indices on sensitive attributes of a relational table. Such indices enable an untrusted server to evaluate obfuscated range queries with minimal information leakage. We analyze the worst-case scenario of inference attacks that can potentially lead to breach of privacy (e.g., estimating the value of a data element within a small error margin) and identify statistical measures of data privacy in the context of these attacks. We also investigate precise privacy guarantees of data partitioning which form the basic building blocks of our index. We then develop a model for the fundamental privacy-utility tradeoff and design a novel algorithm for achieving the desired balance between privacy and utility (accuracy of range query evaluation) of the index. VLDB Algebraic Manipulation of Scientific Datasets. Bill Howe,David Maier 2004 We investigate algebraic processing strategies for large numeric datasets equipped with a possibly irregular grid structure. Such datasets arise, for example, in computational simulations, observation networks, medical imaging, and 2-D and 3-D rendering. Existing approaches for manipulating these datasets are incomplete: The performance of SQL queries for manipulating large numeric datasets is not competitive with specialized tools. Database extensions for processing multidimensional discrete data can only model regular, rectilinear grids. Visualization software libraries are designed to process gridded datasets efficiently, but no algebra has been developed to simplify their use and afford optimization. Further, these libraries are data dependent - physical changes to data representation or organization break user programs. In this paper, we present an algebra of grid-fields for manipulating both regular and irregular gridded datasets, algebraic optimization techniques, and an implementation backed by experimental results. We compare our techniques to those of spatial databases and visualization software libraries, using real examples from an Environmental Observation and Forecasting System. We find that our approach can express optimized plans inaccessible to other techniques, resulting in improved performance with reduced programming effort. VLDB CORDS: Automatic Generation of Correlation Statistics in DB2. Ihab F. Ilyas,Volker Markl,Peter J. Haas,Paul G. Brown,Ashraf Aboulnaga 2004 "When query optimizers erroneously assume that database columns are statistically independent, they can underestimate the selectivities of conjunctive predicates by orders of magnitude. Such underestimation often leads to drastically suboptimal query execution plans. We demonstrate cords, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between column pairs. We apply cords to real, synthetic, and TPC-H benchmark data, and show that cords discovers correlations in an efficient and scalable manner. The output of cords can be visualized graphically, making cords a useful mining and analysis tool for database administrators. cords ranks the discovered correlated column pairs and recommends to the optimizer a set of statistics to collect for the ""most important"" of the pairs. Use of these statistics speeds up processing times by orders of magnitude for a wide range of queries." VLDB Maintenance of Spatial Semijoin Queries on Moving Points. Glenn S. Iwerks,Hanan Samet,Kenneth P. Smith 2004 In this paper, we address the maintenance of spatial semijoin queries over continuously moving points, where points are modeled as linear functions of time. This is analogous to the maintenance of a materialized view except, as time advances, the query result may change independently of updates. As in a materialized view, we assume there is no prior knowledge of updates before they occur. We present a new approach, continuous fuzzy sets (CFS), to maintain continuous spatial semijoins efficiently. CFS is compared experimentally to a simple scaling of previous work. The result is significantly better performance of CFS compared to previous work by up to an order of magnitude in some cases. VLDB Query and Update Efficient B+-Tree Based Indexing of Moving Objects. Christian S. Jensen,Dan Lin,Beng Chin Ooi 2004 A number of emerging applications of data management technology involve the monitoring and querying of large quantities of continuous variables, e.g., the positions of mobile service users, termed moving objects. In such applications, large quantities of state samples obtained via sensors are streamed to a database. Indexes for moving objects must support queries efficiently, but must also support frequent updates. Indexes based on minimum bounding regions (MBRs) such as the R-tree exhibit high concurrency overheads during node splitting, and each individual update is known to be quite costly. This motivates the design of a solution that enables the B+ -tree to manage moving objects. We represent moving-object locations as vectors that are timestamped based on their update time. By applying a novel linearization technique to these values, it is possible to index the resulting values using a single B+-tree that partitions values according to their timestamp and otherwise preserves spatial proximity. We develop algorithms for range and k nearest neighbor queries, as well as continuous queries. The proposal can be grafted into existing database systems cost effectively. An extensive experimental study explores the performance characteristics of the proposal and also shows that it is capable of substantially outperforming the R-tree based TPR-tree for both single and concurrent access scenarios. VLDB GPX: Interactive Mining of Gene Expression Data. Daxin Jiang,Jian Pei,Aidong Zhang 2004 "Discovering co-expressed genes and coherent expression patterns in gene expression data is an important data analysis task in bioinformatics research and biomedical applications. Although various clustering methods have been proposed, two tough challenges still remain on how to integrate the users' domain knowledge and how to handle the high connectivity in the data. Recently, we have systematically studied the problem and proposed an effective approach [3]. In this paper, we describe a demonstration of GPX (for Gene Pattern eXplorer), an integrated environment for interactive exploration of coherent expression patterns and co-expressed genes in gene expression data. GPX integrates several novel techniques, including the coherent pattern index graph, a gene annotation panel, and a graphical interface, to adopt users' domain knowledge and support explorative operations in the clustering procedure. The GPX system as well as its techniques will be showcased, and the progress of GPX will be exemplified using several real-world gene expression data sets." VLDB Compressing Large Boolean Matrices using Reordering Techniques. David S. Johnson,Shankar Krishnan,Jatin Chhugani,Subodh Kumar,Suresh Venkatasubramanian 2004 "Large boolean matrices are a basic representational unit in a variety of applications, with some notable examples being interactive visualization systems, mining large graph structures, and association rule mining. Designing space and time efficient scalable storage and query mechanisms for such large matrices is a challenging problem. We present a lossless compression strategy to store and access such large matrices efficiently on disk. Our approach is based on viewing the columns of the matrix as points in a very high dimensional Hamming space, and then formulating an appropriate optimization problem that reduces to solving an instance of the Traveling Salesman Problem on this space. Finding good solutions to large TSP's in high dimensional Hamming spaces is itself a challenging and little-explored problem -- we cannot readily exploit geometry to avoid the need to examine all N2 inter-city distances and instances can be too large for standard TSP codes to run in main memory. Our multi-faceted approach adapts classical TSP heuristics by means of instance-partitioning and sampling, and may be of independent interest. For instances derived from interactive visualization and telephone call data we obtain significant improvement in access time over standard techniques, and for the visualization application we also make significant improvements in compression." VLDB Data Sharing Through Query Translation in Autonomous Sources. Anastasios Kementsietsidis,Marcelo Arenas 2004 We consider the problem of data sharing between autonomous data sources in an environment where constraints cannot be placed on the shared contents of sources. Our solutions rely on the use of mapping tables which define how data from different sources are associated. In this setting, the answer to a local query, that is, a query posed against the schema of a single source, is augmented by retrieving related data from associated sources. This retrieval of data is achieved by translating, through mapping tables, the local query into a set of queries that are executed against the associated sources. We consider both sound translations (which only retrieve correct answers) and complete translations (which retrieve all correct answers, and no incorrect answers) and we present algorithms to compute such translations. Our solutions are implemented and tested experimentally and we describe here our key findings. VLDB Detecting Change in Data Streams. Daniel Kifer,Shai Ben-David,Johannes Gehrke 2004 Detecting changes in a data stream is an important area of research with many applications. In this paper, we present a novel method for the detection and estimation of change. In addition to providing statistical guarantees on the reliability of detected changes, our method also provides meaningful descriptions and quantification of these changes. Our approach assumes that the points in the stream are independently generated, but otherwise makes no assumptions on the nature of the generating distribution. Thus our techniques work for both continuous and discrete data. In an experimental study we demonstrate the power of our techniques. VLDB Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. Christoph Koch,Stefanie Scherzinger,Nicole Schweikardt,Bernhard Stegmaier 2004 We introduce an extension of the XQuery language, FluX, that supports event-based query processing and the conscious handling of main memory buffers. Purely event-based queries of this language can be executed on streaming XML data in a very direct way. We then develop an algorithm that allows to efficiently rewrite XQueries into the event-based FluX language. This algorithm uses order constraints from a DTD to schedule event handlers and to thus minimize the amount of buffering required for evaluating a query. We discuss the various technical aspects of query optimization and query evaluation within our framework. This is complemented with an experimental evaluation of our approach. VLDB FluXQuery: An Optimizing XQuery Processor for Streaming XML Data. Christoph Koch,Stefanie Scherzinger,Nicole Schweikardt,Bernhard Stegmaier 2004 FluXQuery: An Optimizing XQuery Processor for Streaming XML Data. VLDB Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases. Mohammad R. Kolahdouzan,Cyrus Shahabi 2004 A frequent type of query in spatial networks (e.g., road networks) is to find the K nearest neighbors (KNN) of a given query object. With these networks, the distances between objects depend on their network connectivity and it is computationally expensive to compute the distances (e.g., shortest paths) between objects. In this paper, we propose a novel approach to efficiently and accurately evaluate KNN queries in spatial network databases using first order Voronoi diagram. This approach is based on partitioning a large network to small Voronoi regions, and then pre-computing distances both within and across the regions. By localizing the precomputation within the regions, we save on both storage and computation and by performing across-the-network computation for only the border points of the neighboring regions, we avoid global pre-computation between every node-pair. Our empirical experiments with several real-world data sets show that our proposed solution outperforms approaches that are based on on-line distance computation by up to one order of magnitude, and provides a factor of four improvement in the selectivity of the filter step as compared to the index-based approaches. VLDB Flexible String Matching Against Large Databases in Practice. Nick Koudas,Amit Marathe,Divesh Srivastava 2004 Data Cleaning is an important process that has been at the center of research interest in recent years. Poor data quality is the result of a variety of reasons, including data entry errors and multiple conventions for recording database fields, and has a significant impact on a variety of business issues. Hence, there is a pressing need for technologies that enable flexible (fuzzy) matching of string information in a database. Cosine similarity with tf-idf is a well-established metric for comparing text, and recent proposals have adapted this similarity measure for flexibly matching a query string with values in a single attribute of a relation. In deploying tf-idf based flexible string matching against real AT&T databases, we observed that this technique needed to be enhanced in many ways. First, along the functionality dimension, where there was a need to flexibly match along multiple string-valued attributes, and also take advantage of known semantic equivalences. Second, we identified various performance enhancements to speed up the matching process, potentially trading off a small degree of accuracy for substantial performance gains. In this paper, we report on our techniques and experience in dealing with flexible string matching against real AT&T databases. VLDB Approximate NN queries on Streams with Guaranteed Error/performance Bounds. Nick Koudas,Beng Chin Ooi,Kian-Lee Tan,Rui Zhang 2004 In data stream applications, data arrive continuously and can only be scanned once as the query processor has very limited memory (relative to the size of the stream) to work with. Hence, queries on data streams do not have access to the entire data set and query answers are typically approximate. While there have been many studies on the k Nearest Neighbors (kNN) problem in conventional multi-dimensional databases, the solutions cannot be directly applied to data streams for the above reasons. In this paper, we investigate the kNN problem over data streams. We first introduce the e-approximate kNN (ekNN) problem that finds the approximate kNN answers of a query point Q such that the absolute error of the k-th nearest neighbor distance is bounded by e. To support ekNN queries over streams, we propose a technique called DISC (aDaptive Indexing on Streams by space-filling Curves). DISC can adapt to different data distributions to either (a) optimize memory utilization to answer ekNN queries under certain accuracy requirements or (b) achieve the best accuracy under a given memory constraint. At the same time, DISC provide efficient updates and query processing which are important requirements in data stream applications. Extensive experiments were conducted using both synthetic and real data sets and the results confirm the effectiveness and efficiency of DISC. VLDB CHICAGO: A Test and Evaluation Environment for Coarse-Grained Optimization. Tobias Kraft,Holger Schwarz 2004 Relational OLAP tools and other database applications generate sequences of SQL statements that are sent to the database server as result of a single information request issued by a user. Coarse-Grained Optimization is a practical approach for the optimization of such statement sequences based on rewrite rules. In this demonstration we present the CHICAGO test and evaluation environment that allows to assess the effectiveness of rewrite rules and control strategies. It includes a lightweight heuristic optimizer that modifies a given statement sequence using a small and variable set of rewrite rules. VLDB The Case for Precision Sharing. Sailesh Krishnamurthy,Michael J. Franklin,Joseph M. Hellerstein,Garrett Jacobson 2004 "Sharing has emerged as a key idea of static and adaptive stream query processing systems. Inherent in these systems is a tension between sharing common work and avoiding unnecessary work. Increased sharing has generally led to more unnecessary work. Our approach of precision sharing aims to share aggressively without unnecessary work. We show why ""adaptive"" tuple lineage is more generally applicable and use it for precisely shared static dataflows. We also show how ""static"" ordering constraints can be used for precision sharing in adaptive systems. Finally, we report an experimental study of precision sharing." VLDB Efficient XML-to-SQL Query Translation: Where to Add the Intelligence? Rajasekar Krishnamurthy,Raghav Kaushik,Jeffrey F. Naughton 2004 We consider the efficiency of queries generated by XML to SQL translation. We first show that published XML-to-SQL query translation algorithms are suboptimal in that they often translate simple path expressions into complex SQL queries even when much simpler equivalent SQL queries exist. There are two logical ways to deal with this problem. One could generate suboptimal SQL queries using a fairly naive translation algorithm, and then attempt to optimize the resulting SQL; or one could use a more intelligent translation algorithm with the hopes of generating efficient SQL directly. We show that optimizing the SQL after it is generated is problematic, becoming intractable even in simple scenarios; by contrast, designing a translation algorithm that exploits information readily available at translation time is a promising alternative. To support this claim, we present a translation algorithm that exploits translation time information to generate efficient SQL for path expression queries over tree schemas. VLDB Query Rewrite for XML in Oracle XML DB. Muralidhar Krishnaprasad,Zhen Hua Liu,Anand Manikutty,James W. Warner,Vikas Arora,Susan Kotsovolos 2004 "Oracle XML DB integrates XML storage and querying using the Oracle relational and object relational framework. It has the capability to physically store XML documents by shredding them as relational or object relational data, and creating logical XML documents using SQL/XML publishing functions. However, querying XML in a relational or object relational database poses several challenges. The biggest challenge is to efficiently process queries against XML in a database whose fundamental storage is table-based and whose fundamental query engine is tuple-oriented. In this paper, we present the 'XML Query Rewrite' technique used in Oracle XML DB. This technique integrates querying XML using XPath embedded inside SQL operators and SQL/XML publishing functions with the object relational and relational algebra. A common set of algebraic rules is used to reduce both XML and object queries into their relational equivalent. This enables a large class of XML queries over XML type tables and views to be transformed into their semantically equivalent relational or object relational queries. These queries are then amenable to classical relational optimisations yielding XML query performance comparable to relational. Furthermore, this rewrite technique lays out a foundation to enable rewrite of XQuery [1] over XML." VLDB On Testing Satisfiability of Tree Pattern Queries. Laks V. S. Lakshmanan,Ganesh Ramesh,Hui Wang,Zheng (Jessica) Zhao 2004 XPath and XQuery (which includes XPath as a sublanguage) are the major query languages for XML. An important issue arising in efficient evaluation of queries expressed in these languages is satisfiability, i.e., whether there exists a database, consistent with the schema if one is available, on which the query has a non-empty answer. Our experience shows satisfiability check can effect substantial savings in query evaluation. We systematically study satisfiability of tree pattern queries (which capture a useful fragment of XPath) together with additional constraints, with or without a schema. We identify cases in which this problem can be solved in polynomial time and develop novel efficient algorithms for this purpose. We also show that in several cases, the problem is NP-complete. We ran a comprehensive set of experiments to verify the utility of satisfiability check as a preprocessing step in query processing. Our results show that this check takes a negligible fraction of the time needed for processing the query while often yielding substantial savings. VLDB Query Languages and Data Models for Database Sequences and Data Streams. Yan-Nei Law,Haixun Wang,Carlo Zaniolo 2004 We study the fundamental limitations of relational algebra (RA) and SQL in supporting sequence and stream queries, and present effective query language and data model enrichments to deal with them. We begin by observing the well-known limitations of SQL in application domains which are important for data streams, such as sequence queries and data mining. Then we present a formal proof that, for continuous queries on data streams, SQL suffers from additional expressive power problems. We begin by focusing on the notion of nonblocking (NB) queries that are the only continuous queries that can be supported on data streams. We characterize the notion of nonblocking queries by showing that they are equivalent to monotonic queries. Therefore the notion of NB-completeness for RA can be formalized as its ability to express all monotonic queries expressible in RA using only the monotonic operators of RA. We show that RA is not NB-complete, and SQL is not more powerful than RA for monotonic queries. To solve these problems, we propose extensions that allow SQL to support all the monotonic queries expressible by a Turing machine using only monotonic operators. We show that these extensions are (i) user-defined aggregates (UDAs) natively coded in SQL (rather than in an external language), and (ii) a generalization of the union operator to support the merging of multiple streams according to their timestamps. These query language extensions require matching extensions to basic relational data model to support sequences explicitly ordered by times-tamps. Along with the formulation of very powerful queries, the proposed extensions entail more efficient expressions for many simple queries. In particular, we show that nonblocking queries are simple to characterize according to their syntactic structure. VLDB Limiting Disclosure in Hippocratic Databases. Kristen LeFevre,Rakesh Agrawal,Vuk Ercegovac,Raghu Ramakrishnan,Yirong Xu,David J. DeWitt 2004 We present a practical and efficient approach to incorporating privacy policy enforcement into an existing application and database environment, and we explore some of the semantic tradeoffs introduced by enforcing these privacy policy rules at cell-level granularity. Through a comprehensive set of performance experiments, we show that the cost of privacy enforcement is small, and scalable to large databases. VLDB High-Dimensional OLAP: A Minimal Cubing Approach. Xiaolei Li,Jiawei Han,Hector Gonzalez 2004 Data cube has been playing an essential role in fast OLAP (online analytical processing) in many multi-dimensional data warehouses. However, there exist data sets in applications like bioinformatics, statistics, and text processing that are characterized by high dimensionality, e.g., over 100 dimensions, and moderate size, e.g., around 106 tuples. No feasible data cube can be constructed with such data sets. In this paper we will address the problem of developing an efficient algorithm to perform OLAP on such data sets. Experience tells us that although data analysis tasks may involve a high dimensional space, most OLAP operations are performed only on a small number of dimensions at a time. Based on this observation, we propose a novel method that computes a thin layer of the data cube together with associated value-list indices. This layer, while being manageable in size, will be capable of supporting flexible and fast OLAP operations in the original high dimensional space. Through experiments we will show that the method has I/O costs that scale nicely with dimensionality. Furthermore, the costs are comparable to that of accessing an existing data cube when full materialization is possible. VLDB Schema-Free XQuery. Yunyao Li,Cong Yu,H. V. Jagadish 2004 "The widespread adoption of XML holds out the promise that document structure can be exploited to specify precise database queries. However, the user may have only a limited knowledge of the XML structure, and hence may be unable to produce a correct XQuery, especially in the context of a heterogeneous information collection. The default is to use keyword-based search and we are all too familiar with how difficult it is to obtain precise answers by these means. We seek to address these problems by introducing the notion of Meaningful Lowest Common Ancestor Structure (MLCAS) for finding related nodes within an XML document. By automatically computing MLCAS and expanding ambiguous tag names, we add new functionality to XQuery and enable users to take full advantage of XQuery in querying XML data precisely and efficiently without requiring (perfect) knowledge of the document structure. Such a Schema-Free XQuery is potentially of value not just to casual users with partial knowledge of schema, but also to experts working in a data integration or data evolution context. In such a context, a schema-free query, once written, can be applied universally to multiple data sources that supply similar content under different schemas, and applied ""forever"" as these schemas evolve. Our experimental evaluation found that it was possible to express a wide variety of queries in a schema-free manner and have them return correct results over a broad diversity of schemas. Furthermore, the evaluation of a schema-free query is not expensive using a novel stack-based algorithm we develop for computing MLCAS: from 1 to 4 times the execution time of an equivalent schema-aware query." VLDB Automating the design of multi-dimensional clustering tables in relational databases. Sam Lightstone,Bishwaranjan Bhattacharjee 2004 Automating the design of multi-dimensional clustering tables in relational databases. VLDB VizTree: a Tool for Visually Mining and Monitoring Massive Time Series Databases. Jessica Lin,Eamonn J. Keogh,Stefano Lonardi,Jeffrey P. Lankford,Donna M. Nystrom 2004 Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful launch, can be measured in the tens of millions of dollars, not including the cost in morale and other more intangible detriments. The Aerospace Corporation is responsible for providing engineering assessments critical to the go/no-go decision for every Department of Defense (DoD) launch vehicle. These assessments are made by constantly monitoring streaming telemetry data in the hours before launch. For this demonstration, we will introduce VizTree, a novel time-series visualization tool to aid the Aerospace analysts who must make these engineering assessments. VizTree was developed at the University of California, Riverside and is unique in that the same tool is used for mining archival data and monitoring incoming live telemetry. Unlike other time series visualization tools, VizTree can scale to very large databases, giving it the potential to be a generally useful data mining and database tool. VLDB LH*RS: A Highly Available Distributed Data Storage. Witold Litwin,Rim Moussa,Thomas J. E. Schwarz 2004 The ideal storage system is always available and incrementally expandable. Existing storage systems fall far from this ideal. Affordable computers and high-speed networks allow us to investigate storage architectures closer to the ideal. Our demo, present a prototype implementation of LH*RS: a highly available scalable and distributed data structure. VLDB The Design of GridDB: A Data-Centric Overlay for the Scientific Grid. David T. Liu,Michael J. Franklin 2004 The Design of GridDB: A Data-Centric Overlay for the Scientific Grid. VLDB A-ToPSS: A Publish/Subscribe System Supporting Imperfect Information Processing. Haifeng Liu,Hans-Arno Jacobsen 2004 A-ToPSS: A Publish/Subscribe System Supporting Imperfect Information Processing. VLDB Enhancing P2P File-Sharing with an Internet-Scale Query Processor. Boon Thau Loo,Joseph M. Hellerstein,Ryan Huebsch,Scott Shenker,Ion Stoica 2004 "In this paper, we address the problem of designing a scalable, accurate query processor for peer-to-peer filesharing and similar distributed keyword search systems. Using a globally-distributed monitoring infrastructure, we perform an extensive study of the Gnutella filesharing network, characterizing its topology, data and query workloads. We observe that Gnutella's query processing approach performs well for popular content, but quite poorly for rare items with few replicas. We then consider an alternate approach based on Distributed Hash Tables (DHTs). We describe our implementation of PIERSearch, a DHT-based system, and propose a hybrid system where Gnutella is used to locate popular items, and PIERSearch for handling rare items. We develop an analytical model of the two approaches, and use it in concert with our Gnutella traces to study the trade-off between query recall and system overhead of the hybrid system. We evaluate a variety of localized schemes for identifying items that are rare and worth handling via the DHT. Lastly, we show in a live deployment on fifty nodes on two continents that it nicely complements Gnutella in its ability to handle rare items." VLDB Sybase IQ Multiplex - Designed For Analytics. Roger MacNicol,Blaine French 2004 The internal design of database systems has traditionally given primacy to the needs of transactional data. A radical re-evaluation of the internal design giving primacy to the needs of complex analytics shows clear benefits in large databases for both single servers and in multinode shared-disk grid computing. This design supports the trend to keep more years of more finely grained data online by ameliorating the data explosion problem. VLDB Cache-Conscious Radix-Decluster Projections. Stefan Manegold,Peter A. Boncz,Niels Nes 2004 Cache-Conscious Radix-Decluster Projections. VLDB An Injection of Tree Awareness: Adding Staircase Join to PostgreSQL. Sabine Mayer,Torsten Grust,Maurice van Keulen,Jens Teubner 2004 An Injection of Tree Awareness: Adding Staircase Join to PostgreSQL. VLDB Indexing Temporal XML Documents. Alberto O. Mendelzon,Flavio Rizzolo,Alejandro A. Vaisman 2004 Different models have been proposed recently for representing temporal data, tracking historical information, and recovering the state of the document as of any given time, in XML documents. We address the problem of indexing temporal XML documents. In particular we show that by indexing continuous paths, i.e. paths that are valid continuously during a certain interval in a temporal XML graph, we can dramatically increase query performance. We describe in detail the indexing scheme, denoted TempIndex, and compare its performance against both a system based on a nontemporal path index, and one based on DOM. VLDB PLACE: A Query Processor for Handling Real-time Spatio-temporal Data Streams. Mohamed F. Mokbel,Xiaopeng Xiong,Walid G. Aref,Susanne E. Hambrusch,Sunil Prabhakar,Moustafa A. Hammad 2004 The emergence of location-aware services calls for new real-time spatio-temporal query processing algorithms that deal with large numbers of mobile objects and queries. In this demo, we present PLACE (Pervasive Location-Aware Computing Environments); a scalable location-aware database server developed at Purdue University. The PLACE server addresses scalability by adopting an incremental evaluation mechanism for answering concurrently executing continuous spatio-temporal queries. The PLACE server supports a wide variety of stationery and moving continuous spatio-temporal queries through a set of pipelined spatio-temporal operators. The large numbers of moving objects generate real-time spatio-temporal data streams. VLDB BioPatentMiner: An Information Retrieval System for BioMedical Patents. Sougata Mukherjea,Bhuvan Bamba 2004 Before undertaking new biomedical research, identifying concepts that have already been patented is essential. Traditional keyword based search on patent databases may not be sufficient to retrieve all the relevant information, especially for the biomedical domain. More sophisticated retrieval techniques are required. This paper presents BioPatentMiner, a system that facilitates information retrieval from biomedical patents. It integrates information from the patents with knowledge from biomedical ontologies to create a Semantic Web. Besides keyword search and queries linking the properties specified by one or more RDF triples, the system can discover Semantic Associations between the resources. The system also determines the importance of the resources to rank the results of a search and prevent information overload while determining the Semantic Associations. VLDB Database Challenges in the Integration of Biomedical Data Sets. Rakesh Nagarajan,Mushtaq Ahmed,Aditya Phatak 2004 The clinical and basic science research domains present exciting and difficult data integration issues. Solving these problems is crucial as current research efforts in the field of biomedicine heavily depend upon integrated storage, querying, analysis, and visualization of clinicopathology information, genomic annotation, and large scale functional genomic research data sets. Such large scale experimental analyses are essential to decipher the pathophysiological processes occurring in most human diseases so that they may be effectively treated. In this paper, we discuss the challenges of integration of multiple biomedical data sets not only at the university level but also at the national level and present the data warehousing based solution we have employed at Washington University School of Medicine. We also describe the tools we have developed to store, query, analyze, and visualize these data sets together. VLDB A Combined Framework for Grouping and Order Optimization. Thomas Neumann,Guido Moerkotte 2004 Since the introduction of cost-based query optimization by Selinger et al. in their seminal paper, the performance-critical role of interesting orders has been recognized. Some algebraic operators change interesting orders (e.g. sort and select), while others exploit them (e.g. merge join). Likewise, Wang and Cherniack (VLDB 2003) showed that existing groupings should be exploited to avoid redundant grouping operations. Ideally, the reasoning about interesting orderings and groupings should be integrated into one framework. So far, no complete, correct, and efficient algorithm for ordering and grouping inference has been proposed. We fill this gap by proposing a general two-phase approach that efficiently integrates the reasoning about orderings and groupings. Our experimental results show that with a modest increase of the time and space requirements of the preprocessing phase both orderings and groupings can be handled at the same time. More importantly, there is no additional cost for the second phase during which the plan generator changes and exploits orderings and groupings by adding operators to subplans. VLDB "Trends in Data Warehousing: A Practitioner's View." "William O'Connell" 2004 This talk will present emerging data warehousing reference architectures, and focus on trends and directions that are shaping these enterprise installations. Implications will be highlighted, including both of new and old technology. Stack seamless integration is also pivotal to success, which also has significant implications on things such as Metadata. VLDB "Where is Business Intelligence taking today's Database Systems?" "William O'Connell,Andrew Witkowski,Ramesh Bhashyam,Surajit Chaudhuri" 2004 "Where is Business Intelligence taking today's Database Systems?" VLDB Indexing XML Data Stored in a Relational Database. Shankar Pal,Istvan Cseri,Gideon Schaller,Oliver Seeliger,Leo Giakoumakis,Vasili Vasili Zolotov 2004 As XML usage grows for both data-centric and document-centric applications, introducing native support for XML data in relational databases brings significant benefits. It provides a more mature platform for the XML data model and serves as the basis for interoperability between relational and XML data. Whereas query processing on XML data shredded into one or more relational tables is well understood, it provides limited support for the XML data model. XML data can be persisted as a byte sequence (BLOB) in columns of tables to support the XML model more faithfully. This introduces new challenges for query processing such as the ability to index the XML blob for good query performance. This paper reports novel techniques for indexing XML data in the upcoming version of Microsoft® SQL ServerTM, and how it ties into the relational framework for query processing. VLDB Indexing Large Human-Motion Databases. Eamonn J. Keogh,Themis Palpanas,Victor B. Zordan,Dimitrios Gunopulos,Marc Cardle 2004 Data-driven animation has become the industry standard for computer games and many animated movies and special effects. In particular, motion capture data recorded from live actors, is the most promising approach offered thus far for animating realistic human characters. However, the manipulation of such data for general use and re-use is not yet a solved problem. Many of the existing techniques dealing with editing motion rely on indexing for annotation, segmentation, and re-ordering of the data. Euclidean distance is inappropriate for solving these indexing problems because of the inherent variability found in human motion. The limitations of Euclidean distance stems from the fact that it is very sensitive to distortions in the time axis. A partial solution to this problem, Dynamic Time Warping (DTW), aligns the time axis before calculating the Euclidean distance. However, DTW can only address the problem of local scaling. As we demonstrate in this paper, global or uniform scaling is just as important in the indexing of human motion. We propose a novel technique to speed up similarity search under uniform scaling, based on bounding envelopes. Our technique is intuitive and simple to implement. We describe algorithms that make use of this technique, we perform an experimental analysis with real datasets, and we evaluate it in the context of a motion capture processing system. The results demonstrate the utility of our approach, and show that we can achieve orders of magnitude of speedup over the brute force approach, the only alternative solution currently available. VLDB WIC: A General-Purpose Algorithm for Monitoring Web Information Sources. Sandeep Pandey,Kedar Dhamdhere,Christopher Olston 2004 "The Web is becoming a universal information dissemination medium, due to a number of factors including its support for content dynamicity. A growing number of Web information providers post near real-time updates in domains such as auctions, stock markets, bulletin boards, news, weather, roadway conditions, sports scores, etc. External parties often wish to capture this information for a wide variety of purposes ranging from online data mining to automated synthesis of information from multiple sources. There has been a great deal of work on the design of systems that can process streams of data from Web sources, but little attention has been paid to how to produce these data streams, given that Web pages generally require ""pull-based"" access. In this paper we introduce a new general-purpose algorithm for monitoring Web information sources, effectively converting pull-based sources into push-based ones. Our algorithm can be used in conjunction with continuous query systems that assume information is fed into the query engine in a push-based fashion. Ideally, a Web monitoring algorithm for this purpose should achieve two objectives: (1) timeliness and (2) completeness of information captured. However, we demonstrate both analytically and empirically using real-world data that these objectives are fundamentally at odds. When resources available for Web monitoring are limited, and the number of sources to monitor is large, it may be necessary to sacrifice some timeliness to achieve better completeness, or vice versa. To take this fact into account, our algorithm is highly parameterized and targets an application-specified balance between timeliness and completeness. In this paper we formalize the problem of optimizing for a flexible combination of timeliness and completeness, and prove that our parameterized algorithm is a 2- approximation in all cases, and in certain cases is optimal." VLDB Generating Thousand Benchmark Queries in Seconds. Meikel Poess,John M. Stephens 2004 Generating Thousand Benchmark Queries in Seconds. VLDB Progressive Optimization in Action. Vijayshankar Raman,Volker Markl,David E. Simmen,Guy M. Lohman,Hamid Pirahesh 2004 "Progressive Optimization (POP) is a technique to make query plans robust, and minimize need for DBA intervention, by repeatedly re-optimizing a query during runtime if the cardinalities estimated during optimization prove to be significantly incorrect. POP works by carefully calculating validity ranges for each plan operator under which the overall plan can be optimal. POP then instruments the query plan with checkpoints that validate at runtime that cardinalities do lie within validity ranges, and re-optimizes the query otherwise. In this demonstration we showcase POP implemented for a research prototype version of IBM's DB2 DBMS, using a mix of real-world and synthetic benchmark databases and workloads. For selected queries of the workload we display the query plans with validity ranges as well as the placement of the various kinds of CHECK operators using the DB2 graphical plan explain tool. We also execute the queries, showing how and where re-optimization is triggered through the CHECK operators, the new plan generated upon re-optimization, and the extent to which previously computed intermediate results are reused." VLDB A Multi-Purpose Implementation of Mandatory Access Control in Relational Database Management Systems. Walid Rjaibi,Paul Bird 2004 Mandatory Access Control (MAC) implementations in Relational Database Management Systems (RDBMS) have focused solely on Multilevel Security (MLS). MLS has posed a number of challenging problems to the database research community, and there has been an abundance of research work to address those problems. Unfortunately, the use of MLS RDBMS has been restricted to a few government organizations where MLS is of paramount importance such as the intelligence community and the Department of Defense. The implication of this is that the investment of building an MLS RDBMS cannot be leveraged to serve the needs of application domains where there is a desire to control access to objects based on the label associated with that object and the label associated with the subject accessing that object, but where the label access rules and the label structure do not necessarily match the MLS two security rules and the MLS label structure. This paper introduces a flexible and generic implementation of MAC in RDBMS that can be used to address the requirements from a variety of application domains, as well as to allow an RDBMS to efficiently take part in an end-to-end MAC enterprise solution. The paper also discusses the extensions made to the SQL compiler component of an RDBMS to incorporate the label access rules in the access plan it generates for an SQL query, and to prevent unauthorized leakage of data that could occur as a result of traditional optimization techniques performed by SQL compilers. VLDB Security of Shared Data in Large Systems: State of the Art and Research Directions. Arnon Rosenthal,Marianne Winslett 2004 "The goals of this tutorial are to enlighten the VLDB research community about the state of the art in data security, especially for enterprise or larger systems, and to engage the community's interest in improving the state of the art. The tutorial includes numerous suggested topics for research and development projects in data security." VLDB Symmetric Relations and Cardinality-Bounded Multisets in Database Systems. Kenneth A. Ross,Julia Stoyanovich 2004 In a binary symmetric relationship, A is related to B if and only if B is related to A. Symmetric relationships between k participating entities can be represented as multisets of cardinality k. Cardinality-bounded multisets are natural in several real-world applications. Conventional representations in relational databases suffer from several consistency and performance problems. We argue that the database system itself should provide native support for cardinality-bounded multisets. We provide techniques to be implemented by the database engine that avoid the drawbacks, and allow a schema designer to simply declare a table to be symmetric in certain attributes. We describe a compact data structure, and update methods for the structure. We describe an algebraic symmetric closure operator, and show how it can be moved around in a query plan during query optimization in order to improve performance. We describe indexing methods that allow efficient lookups on the symmetric columns. We show how to perform database normalization in the presence of symmetric relations. We provide techniques for inferring that a view is symmetric. We also describe a syntactic SQL extension that allows the succinct formulation of queries over symmetric relations. VLDB CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. Elke A. Rundensteiner,Luping Ding,Timothy M. Sutherland,Yali Zhu,Bradford Pielech,Nishant Mehta 2004 CAPE: Continuous Query Engine with Heterogeneous-Grained Adaptivity. VLDB Green Query Optimization: Taming Query Optimization Overheads through Plan Recycling. Parag Sarda,Jayant R. Haritsa 2004 PLASTIC [1] is a recently-proposed tool to help query optimizers significantly amortize optimization overheads through a technique of plan recycling. The tool groups similar queries into clusters and uses the optimizer-generated plan for the cluster representative to execute all future queries assigned to the cluster. An earlier demo [2] had presented a basic prototype implementation of PLASTIC. We have now significantly extended the scope, useability, and efficiency of PLASTIC, by incorporating a variety of new features, including an enhanced query feature vector, variable-sized clustering and a decision-tree-based query classifier. The demo of the upgraded PLASTIC tool is shown on commercial database platforms (IBM DB2 and Oracle 9i). VLDB QStream: Deterministic Querying of Data Streams. Sven Schmidt,Henrike Berthold,Wolfgang Lehner 2004 Current developments in processing data streams are based on the best-effort principle and therefore not adequate for many application areas. When sensor data is gathered by interface hardware and is used for triggering data-dependent actions, the data has to be queried and processed not only in an efficient but also in a deterministic way. Our streaming system prototype embodies novel data processing techniques. It is based on an operator component model and runs on top of a real-time capable environment. This enables us to provide real Quality-of-Service for data stream queries. VLDB Clotho: Decoupling memory page layout from storage organization. Minglong Shao,Jiri Schindler,Steven W. Schlosser,Anastassia Ailamaki,Gregory R. Ganger 2004 As database application performance depends on the utilization of the memory hierarchy, smart data placement plays a central role in increasing locality and in improving memory utilization. Existing techniques, however, do not optimize accesses to all levels of the memory hierarchy and for all the different workloads, because each storage level uses different technology (cache, memory, disks) and each application accesses data using different patterns. Clotho is a new buffer pool and storage management architecture that decouples in-memory page layout from data organization on non-volatile storage devices to enable independent data layout design at each level of the storage hierarchy. Clotho can maximize cache and memory utilization by (a) transparently using appropriate data layouts in memory and non-volatile storage, and (b) dynamically synthesizing data pages to follow application access patterns at each level as needed. Clotho creates in-memory pages individually tailored for compound and dynamically changing workloads, and enables efficient use of different storage technologies (e.g., disk arrays or MEMS-based storage devices). This paper describes the Clotho design and prototype implementation and evaluates its performance under a variety of workloads using both disk arrays and simulated MEMS-based storage devices. VLDB AIDA: an Adaptive Immersive Data Analyzer. Mehdi Sharifzadeh,Cyrus Shahabi,Bahareh Navai,Farid Parvini,Albert A. Rizzo 2004 "In this demonstration, we show various querying capabilities of an application called AIDA. AIDA is developed to help the study of attention disorder in kids. In a different study [1], we collected several immresive sensory data streams from kids monitored in an immersive application called the virtual classroom. This dataset, termed immersidata is used to analyze the behavior of kids in the virtual classroom environment. AIDA's database stores all the geometry of the objects in the virtual classroom environment and their spatio-temporal behavior. In addition, it stores all the immersidata collected from the kids experimenting with the application. AIDA's graphical user interface then supports various spatio-temporal queries on these datasets. Moreover, AIDA replays the immersidata streams as if they are collected in real-time and on which supports various continuous queries. This demonstration is a proof-of-concept prototype of a typical design and development of a domain-specific query and analysis application on the users' interaction data with immersive environments." VLDB Efficiency-Quality Tradeoffs for Vector Score Aggregation. Pavan Kumar C. Singitham,Mahathi S. Mahabhashyam,Prabhakar Raghavan 2004 Finding the l nearest neighbors to a query in a vector space is an important primitive in text and image retrieval. Here we study an extension of this problem with applications to XML and image retrieval: we have multiple vector spaces, and the query places a weight on each space. Match scores from the spaces are weighted by these weights to determine the overall match between each record and the query; this is a case of score aggregation. We study approximation algorithms that use a small fraction of the computation of exhaustive search through all records, while returning nearly the best matches. We focus on the tradeoff between the computation and the quality of the results. We develop two approaches to retrieval from such multiple vector spaces. The first is inspired by resource allocation. The second, inspired by computational geometry, combines the multiple vector spaces together with all possible query weights into a single larger space. While mathematically elegant, this abstraction is intractable for implementation. We therefore devise an approximation of this combined space. Experiments show that all our approaches (to varying extents) enable retrieval quality comparable to exhaustive search, while avoiding its heavy computational cost. VLDB Resilient Rights Protection for Sensor Streams. Radu Sion,Mikhail J. Atallah,Sunil Prabhakar 2004 "Today's world of increasingly dynamic computing environments naturally results in more and more data being available as fast streams. Applications such as stock market analysis, environmental sensing, web clicks and intrusion detection are just a few of the examples where valuable data is streamed. Often, streaming information is offered on the basis of a non-exclusive, single-use customer license. One major concern, especially given the digital nature of the valuable stream, is the ability to easily record and potentially ""re-play"" parts of it in the future. If there is value associated with such future re-plays, it could constitute enough incentive for a malicious customer (Mallory) to duplicate segments of such recorded data, subsequently re-selling them for profit. Being able to protect against such infringements becomes a necessity. In this paper we introduce the issue of rights protection for discrete streaming data through watermarking. This is a novel problem with many associated challenges including: operating in a finite window, single-pass, (possibly) high-speed streaming model, surviving natural domain specific transforms and attacks (e.g.extreme sparse sampling and summarizations), while at the same time keeping data alterations within allowable bounds. We propose a solution and analyze its resilience to various types of attacks as well as some of the important expected domain-specific transforms, such as sampling and summarization. We implement a proof of concept software (wms.*) and perform experiments on real sensor data from the NASA Infrared Telescope Facility at the University of Hawaii, to assess encoding resilience levels in practice. Our solution proves to be well suited for this new domain. For example, we can recover an over 97% confidence watermark from a highly down-sampled (e.g. less than 8%) stream or survive stream summarization (e.g. 20%) and random alteration attacks with very high confidence levels, often above 99%." VLDB The Complexity of Fully Materialized Coalesced Cubes. Yannis Sismanis,Nick Roussopoulos 2004 The Complexity of Fully Materialized Coalesced Cubes. VLDB Trust-Serv: A Lightweight Trust Negotiation Service. Halvard Skogsrud,Boualem Benatallah,Fabio Casati,Manh Q. Dinh 2004 Trust-Serv: A Lightweight Trust Negotiation Service. VLDB Tamper Detection in Audit Logs. Richard T. Snodgrass,Shilong (Stanley) Yao,Christian S. Collberg 2004 "Audit logs are considered good practice for business systems, and are required by federal regulations for secure systems, drug approval data, medical information disclosure, financial records, and electronic voting. Given the central role of audit logs, it is critical that they are correct and inalterable. It is not sufficient to say, ""our data is correct, because we store all interactions in a separate audit log."" The integrity of the audit log itself must also be guaranteed. This paper proposes mechanisms within a database management system (DBMS), based on cryptographically strong one-way hash functions, that prevent an intruder, including an auditor or an employee or even an unknown bug within the DBMS itself, from silently corrupting the audit log. We propose that the DBMS store additional information in the database to enable a separate audit log validator to examine the database along with this extra information and state conclusively whether the audit log has been compromised. We show with an implementation on a high-performance storage engine that the overhead for auditing is low and that the validator can efficiently and correctly determine if the audit log has been compromised." VLDB Memory-Limited Execution of Windowed Stream Joins. Utkarsh Srivastava,Jennifer Widom 2004 We address the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result tuples as possible. An alternative scenario is to provide a random sample of the join result, e.g., if the output of the join is being aggregated. We show formally that neither approximation can be addressed effectively for a sliding-window join of arbitrary input streams. Previous work has addressed only the maximum-subset problem, and has implicitly used a frequency-based model of stream arrival. We address the sampling problem for this model. More importantly, we point out a broad class of applications for which an age-based model of stream arrival is more appropriate, and we address both approximation scenarios under this new model. Finally, for the case of multiple joins being executed with an overall memory constraint, we provide an algorithm for memory allocation across the joins that optimizes a combined measure of approximation in all scenarios considered. All of our algorithms are implemented and experimental results demonstrate their effectiveness. VLDB The Bloomba Personal Content Database. Raymie Stata,Patrick Hunt,Thiruvalluvan M. G. 2004 The Bloomba Personal Content Database. VLDB Semantic Query Optimization in an Automata-Algebra Combined XQuery Engine over XML Streams. Hong Su,Elke A. Rundensteiner,Murali Mani 2004 Semantic Query Optimization in an Automata-Algebra Combined XQuery Engine over XML Streams. VLDB Answering XPath Queries over Networks by Sending Minimal Views. Keishi Tajima,Yoshiki Fukui 2004 When a client submits a set of XPath queries to a XML database on a network, the set of answer sets sent back by the database may include redundancy in two ways: some elements may appear in more than one answer set, and some elements in some answer sets may be subelements of other elements in other (or the same) answer sets. Even when a client submits a single query, the answer can be self-redundant because some elements may be subelements of other elements in that answer. Therefore, sending those answers as they are is not optimal with respect to communication costs. In this paper, we propose a method of minimizing communication costs in XPath processing over networks. Given a single or a set of queries, we compute a minimal-size view set that can answer all the original queries. The database sends this view set to the client, and the client produces answers from it. We show algorithms for computing such a minimal view set for given queries. This view set is optimal; it only includes elements that appear in some of the final answers, and each element appears only once. VLDB Reverse kNN Search in Arbitrary Dimensionality. Yufei Tao,Dimitris Papadias,Xiang Lian 2004 Given a point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following deficiencies: (i) they do not support arbitrary values of k (ii) they cannot deal efficiently with database updates, (iii) they are applicable only to 2D data (but not to higher dimensionality), and (iv) they retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact processing of RkNN with arbitrary values of k on dynamic multidimensional datasets. Our methods utilize a conventional data-partitioning index on the dataset and do not require any pre-computation. In addition to their flexibility, we experimentally verify that the proposed algorithms outperform the existing ones even in their restricted focus. VLDB Practical Suffix Tree Construction. Sandeep Tata,Richard A. Hankins,Jignesh M. Patel 2004 "Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we present a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known disk-based construction algorithms." VLDB SVT: Schema Validation Tool for Microsoft SQL-Server. Ernest Teniente,Carles Farré,Toni Urpí,Carlos Beltrán,David Gañán 2004 We present SVT, a tool for validating database schemas in SQL Server. This is done by means of testing desirable properties that a database schema should satisfy. To our knowledge, no commercial relational DBMS provides yet a tool able to perform such kind of validation. VLDB Top-k Query Evaluation with Probabilistic Guarantees. Martin Theobald,Gerhard Weikum,Ralf Schenkel 2004 "Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin's threshold algorithm (TA). Since the user's goal behind top-k queries is to identify one or a few relevant and novel data items, it is intriguing to use approximate variants of TA to reduce run-time costs. This paper introduces a family of approximate top-k algorithms based on probabilistic arguments. When scanning index lists of the underlying multidimensional data space in descending order of local scores, various forms of convolution and derived bounds are employed to predict when it is safe, with high probability, to drop candidate items and to prune the index scans. The precision and the efficiency of the developed methods are experimentally evaluated based on a large Web corpus and a structured data collection." VLDB AWESOME - A Data Warehouse-based System for Adaptive Website Recommendations. Andreas Thor,Erhard Rahm 2004 Recommendations are crucial for the success of large websites. While there are many ways to determine recommendations, the relative quality of these recommenders depends on many factors and is largely unknown. We propose a new classification of recommenders and comparatively evaluate their relative quality for a sample web-site. The evaluation is performed with AWESOME (Adaptive website recommendations), a new data warehouse-based recommendation system capturing and evaluating user feedback on presented recommendations. Moreover, we show how AWESOME performs an automatic and adaptive closed-loop website optimization by dynamically selecting the most promising recommenders based on continuously measured recommendation feedback. We propose and evaluate several alternatives for dynamic recommender selection including a powerful machine learning approach. VLDB Biological Data Management: Research, Practice and Opportunities. Thodoros Topaloglou,Susan B. Davidson,H. V. Jagadish,Victor M. Markowitz,Evan W. Steeg,Mike Tyers 2004 Biological Data Management: Research, Practice and Opportunities. VLDB BilVideo Video Database Management System. Özgür Ulusoy,Ugur Güdükbay,Mehmet Emin Dönderler,Ediz Saykol,Cemil Alper 2004 A prototype video database management system, which we call BilVideo, is presented. BilVideo provides an integrated support for queries on spatio-temporal, semantic and low-level features (color, shape, and texture) on video data. BilVideo does not target a specific application, and thus, it can be used to support any application with video data. An example application, news archives search system, is presented with some sample queries. VLDB Computing PageRank in a Distributed Internet Search Engine System. Yuan Wang,David J. DeWitt 2004 Computing PageRank in a Distributed Internet Search Engine System. VLDB Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. Wei Wang,Haifeng Jiang,Hongjun Lu,Jeffrey Xu Yu 2004 Cost-based XML query optimization calls for accurate estimation of the selectivity of path expressions. Some other interactive and internet applications can also benefit from such estimations. While there are a number of estimation techniques proposed in the literature, almost none of them has any guarantee on the estimation accuracy within a given space limit. In addition, most of them assume that the XML data are more or less static, i.e., with few updates. In this paper, we present a framework for XML path selectivity estimation in a dynamic context. Specifically, we propose a novel data structure, bloom histogram, to approximate XML path frequency distribution within a small space budget and to estimate the path selectivity accurately with the bloom histogram. We obtain the upper bound of its estimation error and discuss the trade-offs between the accuracy and the space limit. To support updates of bloom histograms efficiently when underlying XML data change, a dynamic summary layer is used to keep exact or more detailed XML path information. We demonstrate through our extensive experiments that the new solution can achieve significantly higher accuracy with an even smaller space than the previous methods in both static and dynamic environments. VLDB Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. Jiying Wang,Ji-Rong Wen,Frederick H. Lochovsky,Wei-Ying Ma 2004 In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schema-matching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall. VLDB On the performance of bitmap indices for high cardinality attributes. Kesheng Wu,Ekow J. Otoo,Arie Shoshani 2004 It is well established that bitmap indices are efficient for read-only attributes with low attribute cardinalities. For an attribute with a high cardinality, the size of the bitmap index can be very large. To overcome this size problem, specialized compression schemes are used. Even though there are empirical evidences that some of these compression schemes work well, there has not been any systematic analysis of their effectiveness. In this paper, we systematically analyze the two most efficient bitmap compression techniques, the Byte-aligned Bitmap Code (BBC) and the Word-Aligned Hybrid (WAH) code. Our analyses show that both compression schemes can be optimal. We propose a novel strategy to select the appropriate algorithms so that this optimality is achieved in practice. In addition, our analyses and tests show that the compressed indices are relatively small compared with commonly used indices such as B-trees. Given these facts, we conclude that bitmap index is efficient on attributes of low cardinalities as well as on those of high cardinalities. VLDB Gorder: An Efficient Method for KNN Join Processing. Chenyi Xia,Hongjun Lu,Beng Chin Ooi,Jin Hu 2004 An important but very expensive primitive operation of high-dimensional databases is the K-Nearest Neighbor (KNN) similarity join. The operation combines each point of one dataset with its KNNs in the other dataset and it provides more meaningful query results than the range similarity join. Such an operation is useful for data mining and similarity search. In this paper, we propose a novel KNN-join algorithm, called the Gorder (or the G-ordering KNN) join method. Gorder is a block nested loop join method that exploits sorting, join scheduling and distance computation filtering and reduction to reduce both I/O and CPU costs. It sorts input datasets into the G-order and applied the scheduled block nested loop join on the G-ordered data. The distance computation reduction is employed to further reduce CPU cost. It is simple and yet efficient, and handles high-dimensional data efficiently. Extensive experiments on both synthetic cluster and real life datasets were conducted, and the results illustrate that Gorder is an efficient KNN-join method and outperforms existing methods by a wide margin. VLDB Semantic Mining and Analysis of Gene Expression Data. Xin Xu,Gao Cong,Beng Chin Ooi,Kian-Lee Tan,Anthony K. H. Tung 2004 Association rules can reveal biological relevant relationship between genes and environments / categories. However, most existing association rule mining algorithms are rendered impractical on gene expression data, which typically contains thousands or tens of thousands of columns (gene expression levels), but only tens of rows (samples). The main problem is that these algorithms have an exponential dependence on the number of columns. Another shortcoming is evident that too many associations are generated from such kind of data. To this end, we have developed a novel depth-first row-wise algorithm FARMER [2] that is specially designed to efficiently discover and cluster association rules into interesting rule groups (IRGs) that satisfy user-specified minimum support, confidence and chi-square value thresholds on biological datasets as opposed to finding association rules individually. Based on FARMER, we have developed a prototype system that integrates semantic mining and visual analysis of IRGs mined from gene expression data. VLDB Databases in a Wireless World. David Yach 2004 "The traditional view of distributed databases is based on a number of database servers with regular communication. Today information is stored not only in these central databases, but on a myriad of computers and computer-based devices in addition to the central storage. These range from desktop and laptop computers to PDA's and wireless devices such as cellular phones and BlackBerry's. The combination of large centralized databases with a large number and variety of associated edge databases effectively form a large distributed database, but one where many of the traditional rules and assumptions for distributed databases are no longer true. This keynote will discuss some of the new and challenging attributes of this new environment, particularly focusing on the challenges of wireless and occasionally connected devices. It will look at the new constraints, how these impact the traditional distributed database model, the techniques and heuristics being used to work within these constraints, and identify the potential areas where future research might help tackle these difficult issues." VLDB Secure XML Publishing without Information Leakage in the Presence of Data Inference. Xiaochun Yang,Chen Li 2004 "Recent applications are seeing an increasing need that publishing XML documents should meet precise security requirements. In this paper, we consider data-publishing applications where the publisher specifies what information is sensitive and should be protected. We show that if a partial document is published carelessly, users can use common knowledge (e.g., ""all patients in the same ward have the same disease"") to infer more data, which can cause leakage of sensitive information. The goal is to protect such information in the presence of data inference with common knowledge. We consider common knowledge represented as semantic XML constraints. We formulate the process how users can infer data using three types of common XML constraints. Interestingly, no matter what sequences users follow to infer data, there is a unique, maximal document that contains all possible inferred documents. We develop algorithms for finding a partial document of a given XML document without causing information leakage, while allowing publishing as much data as possible. Our experiments on real data sets show that effect of inference on data security, and how the proposed techniques can prevent such leakage from happening." VLDB False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. Jeffrey Xu Yu,Zhihong Chong,Hongjun Lu,Aoying Zhou 2004 The problem of finding frequent items has been recently studied over high speed data streams. However, mining frequent itemsets from transactional data streams has not been well addressed yet in terms of its bounds of memory consumption. The main difficulty is due to the nature of the exponential explosion of itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I - 1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory. However, the real killer of effective frequent itemset mining is that most of existing algorithms are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter ε, and allow items with support below the specified minimum support s but above s-ε counted as frequent ones. Such false-positive items increase the number of false-positive frequent itemsets exponentially, which may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. We developed algorithms based on Chernoff bound. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms. VLDB Stochastic Consistency, and Scalable Pull-Based Caching for Erratic Data Sources. Shanzhong Zhu,Chinya V. Ravishankar 2004 Stochastic Consistency, and Scalable Pull-Based Caching for Erratic Data Sources. VLDB DB2 Design Advisor: Integrated Automatic Physical Database Design. Daniel C. Zilio,Jun Rao,Sam Lightstone,Guy M. Lohman,Adam J. Storm,Christian Garcia-Arellano,Scott Fadden 2004 "The DB2 Design Advisor in IBM® DB2® Universal DatabaseTM (DB2 UDB) Version 8.2 for Linux®, UNIX® and Windows® is a tool that, for a given workload, automatically recommends physical design features that are any subset of indexes, materialized query tables (also called materialized views), shared-nothing database partitionings, and multidimensional clustering of tables. Our work is the very first industrial-strength tool that covers the design of as many as four different features, a significant advance to existing tools, which support no more than just indexes and materialized views. Building such a tool is challenging, because of not only the large search space introduced by the interactions among features, but also the extensibility needed by the tool to support additional features in the future. We adopt a novel ""hybrid"" approach in the Design Advisor that allows us to take important interdependencies into account as well as to encapsulate design features as separate components to lower the reengineering cost. The Design Advisor also features a built-in module that automatically reduces the given workload, and therefore provides great scalability for the tool. Our experimental results demonstrate that our tool can quickly provide good physical design recommendations that satisfy users' requirements." VLDB HOS-Miner: A System for Detecting Outlying Subspaces of High-dimensional Data. Ji Zhang,Meng Lou,Tok Wang Ling,Hai H. Wang 2004 We identify a new and interesting high-dimensional outlier detection problem in this paper, that is, detecting the subspaces in which given data points are outliers. We call the subspaces in which a data point is an outlier as its Outlying Subspaces. In this paper, we will propose the prototype of a dynamic subspace search system, called HOS-Miner (HOS stands for High-dimensional Outlying Subspaces), that utilizes a sample-based learning process to effectively identify the outlying subspaces of a given point. SIGMOD Record Simulation data as data streams. Ghaleb Abdulla,Terence Critchlow,William Arrighi 2004 Computational or scientific simulations are increasingly being applied to solve a variety of scientific problems. Domains such as astrophysics, engineering, chemistry, biology, and environmental studies are benefiting from this important capability. Simulations, however, produce enormous amounts of data that need to be analyzed and understood. In this overview paper, we describe scientific simulation data, its characteristics, and the way scientists generate and use the data. We then compare and contrast simulation data to data streams. Finally, we describe our approach to analyzing simulation data, present the AQSim (Ad-hoc Queries for Simulation data) system, and discuss some of the challenges that result from handling this kind of data. SIGMOD Record Book Review: Database Tuning Principles, Experiments, and Troubleshooting Techniques - by Dennis Shasha and Philippe Bonnet . Nancy Hartline Bercich 2004 Book Review: Database Tuning Principles, Experiments, and Troubleshooting Techniques - by Dennis Shasha and Philippe Bonnet . SIGMOD Record Book Review Column. Karl Aberer 2004 Book Review Column. SIGMOD Record Book Review Column. Karl Aberer 2004 This is the last issue of the book review column that will appear under my responsibility. I would like to thank here all authors of book reviews for their very interesting contributions over the last four years. I also hope the readers of SIGMOD RECORD found the articles in this column of interest and that they motivated in some cases to have a more detailed look at one of the reviewed books. For me it was surely an interesting experience seeing which new books have arrived over the last four years and also how they are appreciated by the community. SIGMOD Record Industrial-Strength Schema Matching. Philip A. Bernstein,Sergey Melnik,Michalis Petropoulos,Christoph Quix 2004 Schema matching identifies elements of two given schemas that correspond to each other. Although there are many algorithms for schema matching, little has been written about building a system that can be used in practice. We describe our initial experience building such a system, a customizable schema matcher called Protoplasm. SIGMOD Record Kanata: Adaptation and Evolution in Data Sharing Systems. Periklis Andritsos,Ariel Fuxman,Anastasios Kementsietsidis,Renée J. Miller,Yannis Velegrakis 2004 "In Toronto's Kanata project, we are investigating the integration and exchange of data and metadata in dynamic, autonomous environments. Our focus is on the development and maintenance of semantic mappings that permit runtime sharing of information." SIGMOD Record BioFast: Challenges in Exploring Linked Life Science Sources. Jens Bleiholder,Zoé Lacroix,Hyma Murthy,Felix Naumann,Louiqa Raschid,Maria-Esther Vidal 2004 BioFast: Challenges in Exploring Linked Life Science Sources. SIGMOD Record Managing and Analyzing Carbohydrate Data. Kiyoko F. Aoki,Nobuhisa Ueda,Atsuko Yamaguchi,Tatsuya Akutsu,Minoru Kanehisa,Hiroshi Mamitsuka 2004 One of the most vital molecules in multicellular organisms is the carbohydrate, as it is structurally important in the construction of such organisms. In fact, all cells in nature carry carbohydrate sugar chains, or glycans, that help modulate various cell-cell events for the development of the organism. Unfortunately, informatics research on glycans has been slow in comparison to DNA and proteins, largely due to difficulties in the biological analysis of glycan structures. Our work consists of data engineering approaches in order to glean some understanding of the current glycan data that is publicly available. In particular, by modeling glycans as labeled unordered trees, we have implemented a tree-matching algorithm for measuring tree similarity. Our algorithm utilizes proven efficient methodologies in computer science that has been extended and developed for glycan data. Moreover, since glycans are recognized by various agents in multicellular organisms, in order to capture the patterns that might be recognized, we needed to somehow capture the dependencies that seem to range beyond the directly connected nodes in a tree. Therefore, by defining glycans as labeled ordered trees, we were able to develop a new probabilistic tree model such that sibling patterns across a tree could be mined. We provide promising results from our methodologies that could prove useful for the future of glycome informatics. SIGMOD Record A Denotational Semantics for Continuous Queries over Streams and Relations. Arvind Arasu,Jennifer Widom 2004 Continuous queries over data streams are an important new class of queries motivated by a number of applications [BBD+02, Geh03, GO03], and several languages for continuous queries have been proposed recently [ABW03, CC+02, CC+03, WZL03]. To date the semantics of these languages have been specified fairly informally, sometimes solely through illustrative examples. SIGMOD Record A context-aware methodology for very small data base design. Cristiana Bolchini,Fabio A. Schreiber,Letizia Tanca 2004 The design of a Data Base to be resident on portable devices and embedded processors for professional systems requires considering both the device memory peculiarities and the mobility aspects, which are an essential feature of the embedded applications. Moreover, these devices are often part of a larger Information System, comprising fixed and mobile resources. We propose a complete methodology for designing Very Small Data Bases, from the identification of the device resident portions down to the choice of the physical data structure, optimizing the cost and power consumption of the Flash memory, which - in the greatest generality - constitutes the permanent storage of the device. SIGMOD Record Using Reasoning to Guide Annotation with Gene Ontology Terms in GOAT. Michael Bada,Daniele Turi,Robin McEntire,Robert Stevens 2004 High-quality annotation of biological data is central to bioinformatics. Annotation using terms from ontologies provides reliable computational access to data. The Gene Ontology (GO), a structured controlled vocabulary of nearly 17,000 terms, is becoming the de facto standard for describing the functionality of gene products. Many prominent biomedical databases use GO as a source of terms for functional annotation of their gene-product entries to promote consistent querying and interoperability. However, current annotation editors do not constrain the choice of GO terms users may enter for a given gene product, potentially resulting in an inconsistent or even nonsensical description. Furthermore, the process of annotation is largely an unguided one in which the user must wade through large GO subtrees in search of terms. Relying upon a reasoner loaded with a DAML+OIL version of GO and an instance store of mined GO-term-to-GO-term associations, GOAT aims to aid the user in the annotation of gene products with GO terms by displaying those field values that are most likely to be appropriate based on previously entered terms. This can result in a reduction in biologically inconsistent combinations of GO terms and a less tedious annotation process on the part of the user. SIGMOD Record Entity-Relationship modeling revisited. Antonio Badia 2004 In this position paper, we argue the modern applications require databases to capture and enforce more domain semantics than traditional applications. We also argue that the best way to incorporate additional semantics into database systems is by capturing the added information in conceptual models and then using it for database design. In this light, we revisit Entity-Relationship models and investigate ways in which such models could be extended to play a role in the process. Inspired by a paper by Rafael Camps Pare ([2]), we suggest avenues of research in the issue. SIGMOD Record Mobile Databases: a Selection of Open Issues and Research Directions. Guy Bernard,Jalel Ben-Othman,Luc Bouganim,Gérôme Canals,Sophie Chabridon,Bruno Defude,Jean Ferrié,Stéphane Gançarski,Rachid Guerraoui,Pascal Molli,Philippe Pucheral,Claudia Roncancio,Patricia Serrano-Alvarado,Patrick Valduriez 2004 This paper reports on the main results of a specific action on mobile databases conducted by CNRS in France from October 2001 to December 2002. The objective of this action was to review the state of progress in mobile databases and identify major research directions for the French database community. Rather than provide a survey of all important issues in mobile databases, this paper gives an outline of the directions in which the action participants are now engaged, namely: copy synchronization in disconnected computing, mobile transactions, database embedded in ultra-light devices, data confidentiality, P2P dissemination models and middleware adaptability. SIGMOD Record Reminiscences on Influential Papers - Kenneth A. Ross. Nicolas Bruno,Samuel Madden,Wei Wang 2004 Reminiscences on Influential Papers - Kenneth A. Ross. SIGMOD Record Structured Databases on the Web: Observations and Implications. Kevin Chen-Chuan Chang,Bin He,Chengkai Li,Mitesh Patel,Zhen Zhang 2004 "The Web has been rapidly ""deepened"" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this ""deep Web"" of searchable databses is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our ""macro"" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our ""micro"" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How ""hidden"" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions." SIGMOD Record NESTREAM: Querying Nested Streams. Damianos Chatziantoniou,Achilleas Anagnostopoulos 2004 "This article identifies an interesting class of applications where stream sessions may be organized in a hierarchical fashion - i.e. sessions may contain sub-sessions. For example, log streams from call centers belong to different call sessions and call sessions consist of services' sub-sessions. We may want to monitor statistics and perform accounting at any level on this hierarchy, relative to any other higher level (e.g. monitoring the average service session per call vs. the average service session for the entire system.) We argue that data streams of this kind have rich procedural semantics - i.e. behavior - and therefore a semantically rich model should be used: a session may be defined by opening and closing conditions, may have data and methods and may consist of sub-sessions. We propose a simple conceptual model based on the notion of ""session"" - similar to a class in an object-oriented environment -- having lifetime semantics. Queries on top of this schema can be formulated via HSA (hierarchical stream aggregate) expressions. We describe an algorithm dictating how stream data flow down session hierarchies and discuss potential evaluation and optimization techniques for HSAs. Finally we introduce NESTREAM, a prototype implementation for these ideas and give some preliminary experimental results." SIGMOD Record Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification. Liangyou Chen,Hasan M. Jamil,Nan Wang 2004 Biological data analyses usually require complex manipulations involving tool applications, multiple web site navigation, result selection and filtering, and iteration over the internet. Most biological data are generated from structured databases and by applications and presented to the users embedded within repeated structures, or tables, in HTML documents. In this paper we outline a novel technique for the identification of table structures in HTML documents. This identification technique is then used to automatically generate composite wrappers for applications requiring distributed resources. We demonstrate that our method is robust enough to discover standard as well as non-standard table structures in HTML documents. Thus our technique outperforms contemporary techniques used in systems such as XWrap and AutoWrapper. We discuss our technique in the context of our PickUp system that exploits the theoretical developments presented in this paper and emerges as an elegant automatic wrapper generation system. SIGMOD Record Piers: An Efficient Model for Similarity Search in DNA Sequence Databases. Xia Cao,Shuai Cheng Li,Beng Chin Ooi,Anthony K. H. Tung 2004 "Growing interest in genomic research has resulted in the creation of huge biological sequence databases. In this paper, we present a hash-based pier model for efficient homology search in large DNA sequence databases. In our model, only certain segments in the databases called 'piers' need to be accessed during searches as opposite to other approaches which require a full scan on the biological sequence database. To further improve the search efficiency, the piers are stored in a specially designed hash table which helps to avoid expensive alignment operation. The has table is small enough to reside in main memory, hence avoiding I/O in the search steps. We show theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity." SIGMOD Record An initial study of overheads of eddies. Amol Deshpande 2004 "An eddy [2] is a highly adaptive query processing operator that continuously reoptimizes a query in response to changing runtime conditions. It does this by treating query processing as routing of tuples through operators and making per-tuple routing decisions. The benefits of such adaptivity can be significant, especially in highly dynamic environments such as data streams, sensor query processing, web querying, etc. Various parties have asserted that the cost of making per-tuple routing decisions is prohibitive. We have implemented eddies in the PostgreSQL open source database system [1] in the context of the TelegraphCQ project. In this paper, we present an ""apples-to-apples"" comparison of PostgreSQL query processing overhead with and without eddies. Our results show that with some minor tuning, the overhead of the eddy mechanism is negligible." SIGMOD Record Semantic Integration Workshop at the 2nd International Semantic Web Conference (ISWC-2003). AnHai Doan,Alon Y. Halevy,Natalya Fridman Noy 2004 Semantic Integration Workshop at the 2nd International Semantic Web Conference (ISWC-2003). SIGMOD Record Introduction to the Special Issue on Semantic Integration. AnHai Doan,Natalya Fridman Noy,Alon Y. Halevy 2004 Semantic heterogeneity is one of the key challenges in integrating and sharing data across disparate sources, data exchange and migration, data warehousing, model management, the Semantic Web and peer-to-peer databases. Semantic heterogeneity can arise at the schema level and at the data level. At the schema level, sources can differ in relations, attribute and tag names, data normalization, levels of detail, and the coverage of a particular domain. The problem of reconciling schema-level heterogeneity is often referred to as schema matching or schema mapping. At the data level, we find different representations of the same real-world entities (e.g., people, companies, publications, etc.). Reconciling data-level heterogeneity is referred to as data deduplication, record linkage, and entity/object matching. To exacerbate the heterogeneity challenges, schema elements of one source can be represented as data in another. This special issue presents a set of articles that describe recent work on semantic heterogeneity at the schema level. SIGMOD Record Semantically Enriched Web Services for the Travel Industry. Asuman Dogac,Yildiray Kabak,Gokce Laleci,Siyamed S. Sinir,Ali Yildiz,Serkan Kirbas,Yavuz Gurcan 2004 Today, the travel information services are dominantly provided by Global Distribution Systems (GDS). The Global Distribution Systems provide access to real time availability and price information for flights, hotels and car rental companies. However GDSs have legacy architectures with private networks, specialized hardware, limited speed and search capabilities. Furthermore, being legacy systems, it is very difficult to interoperate them with other systems and data sources. For these reasons, Web service technology is an ideal fit for travel information systems. However to be able to exploit Web services to their full potential, it is necessary to introduce semantics. Without describing the semantics of Web services we are looking for, it is difficult to find them in an automated way and if we cannot describe the service we have, the probability that people will find it in an automated way is low. Furthermore, to make the semantics machine processable and interoperable, we need to describe domain knowledge through standard ontology languages. In this paper, we describe how to deploy semantically enriched travel Web services and how to exploit semantics through Web service registries. We also address the need to use the semantics in discovering both Web services and Web service registries through peer-to-peer technology. SIGMOD Record An Early Look at XQuery API for Java (XQJ). Andrew Eisenberg,Jim Melton 2004 An Early Look at XQuery API for Java (XQJ). SIGMOD Record Advancements in SQL/XML. Andrew Eisenberg,Jim Melton 2004 Since we last wrote about SQL/XML in [2], the first edition of that new part of the SQL standard has been officially published as an international standard [1], commonly called SQL/XML:2003. At the time of that earlier column, SQL/XML was just entering its first official ballot, meaning that (possibly significant) changes to the text were expected in response to ballot comments submitted by the various participants in the SQL standardization process. SIGMOD Record A Holistic Paradigm for Large Scale Schema Matching. Bin He,Kevin Chen-Chuan Chang 2004 "Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondences in isolation. In contrast, we propose a new matching paradigm, holistic schema matching, to match many schemas at the same time and find all matchings at once. By handling a set of schemas together, we can explore their context information that reflects the semantic correspondences among attributes. Such information is not available when schemas are matched only in pairs. As the realizations of holistic schema matching, we develop two alternative approaches: global evaluation and local evaluation. Global evaluation exhaustively assesses all possible ""models,"" where a model expresses all attribute matchings. In particular, we propose the MGS framework for such global evaluation, building upon the hypothesis of the existence of a hidden schema model that probabilistically generates the schemas we observed. On the other hand, local evaluation independently assesses every single matching to incrementally construct such a model. In particular, we develop the DCM framework for local evaluation, building upon the observation that co-occurrence patterns across schemas often reveal the complex relationships of attributes. We apply our approaches to match query interfaces on the deep Web. The result shows the effectiveness of both the MGS and DCM approaches, which together demonstrate the promise of holistic schema matching." SIGMOD Record SQL: 2003 has been published. Andrew Eisenberg,Jim Melton,Krishna G. Kulkarni,Jan-Eike Michels,Fred Zemke 2004 SQL: 2003 has been published. SIGMOD Record Automatic Direct and Indirect Schema Mapping: Experiences and Lessons Learned. David W. Embley,Li Xu,Yihong Ding 2004 Schema mapping produces a semantic correspondence between two schemas. Automating schema mapping is challenging. The existence of 1:n (or n:1) and n:m mapping cardinalities makes the problem even harder. Recently, we have studied automated schema mapping techniques (using data frames and domain ontology snippets) that not only address the traditional 1:1 mapping problem, but also the harder 1:n and n:m mapping problems. Experimental results show that the approach can achieve excellent precision and recall. In this paper, we share our experiences and lessons we have learned during our schema mapping studies. SIGMOD Record "Report on the 9th ACM Symposium on Access Control Models and Technologies (SACMMAT'04)." Elena Ferrari 2004 "SACMAT'04 was held on June 2--4, 2004, at Yorktown Heights, New York, USA and was hosted by IBM T.J. Watson Research Center. The symposium, which was colocated with the IEEE International Workshop on Policies for Distributed Systems and Networks (POLICY 2004), continues its tradition of being the premier forum for presentation of research results and experience reports on leading edge issues of access control and related technologies, including models, systems, applications, and theory. SACMAT gives researchers and practitioners a unique opportunity to share their perspectives with others interested in the various aspects of access control. The call for papers attracted 65 submissions, from all over the world. The program committee selected 18 papers for presentation. The 18 papers, organized in seven sessions, were presented over two and an half days. The selected papers cover a wide range of topics, ranging from next generation access control models, to security analysis, role administration, policy specification and implementation, and access control for distributed environments and XML data. All the accepted papers were included in a volume published by ACM, whereas best papers from the symposium have been invited for possible publication in ACM Transactions on Information and System Security (TISSEC). Besides the technical sessions, this year program included a keynote speech on Access Control for Databases, and a panel on Security for Grid-based Computing Systems." SIGMOD Record A Read-Only Transaction Anomaly Under Snapshot Isolation. "Alan Fekete,Elizabeth J. O'Neil,Patrick E. O'Neil" 2004 Snapshot Isolation (SI), is a multi-version concurrency control algorithm introduced in [BBGMOO95] and later implemented by Oracle. SI avoids many concurrency errors, and it never delays read-only transactions. However it does not guarantee serializability. It has been widely assumed that, under SI, read-only transactions always execute serializably provided the concurrent update transactions are serializable. The reason for this is that all SI reads return values from a single instant of time when all committed transactions have completed their writes and no writes of non-committed transactions are visible. This seems to imply that read-only transactions will not read anomalous results so long as the update transactions with which they execute do not write such results. In the current note, however, we exhibit an example contradicting these assumptions: it is possible for an SI history to be non-serializable while the sub-history containing all update transactions is serializable. SIGMOD Record Replica allocation for correlated data items in ad hoc sensor networks. Takahiro Hara,Norishige Murakami,Shojiro Nishio 2004 To improve data accessibility in ad hoc networks, in our previous work we proposed three methods of replicating data items by considering the data access frequencies from mobile nodes to each data item and the network topology. In this paper, we extend our previously proposed methods to consider the correlation among data items. Under these extended methods, the data priority of each data item is de-fined based on the correlation among data items, and data items are replicated at mobile nodes with the data priority. We employ simulations to show that the extended methods are more efficient than the original ones. SIGMOD Record Team communications among autonomous sensor swarms. Mario Gerla,Yunjung Yi 2004 In this paper, we consider team (swarm) of unmanned vehicles (UVs) equipped with various sensors (videos, chemicals, etc). Those swarms need efficient communication to feed sensed data, communicate data to other swarms, to navigate and, more generally, to carry out complex mission autonomously. We focus on a particular aspect of mission oriented communications, namely, team multicast. In team multicast, the multicast group does not consist of individual members, rather, of teams. In our case, the teams may consist of special UVs that have been established to launch a search and rescue mission. Simulation results illustrate the performance benefits of the team multicast solution as compared with more traditional multicast approaches. SIGMOD Record "Report on the Dagstuhl Seminar: ""data quality on the Web""." Michael Gertz,M. Tamer Özsu,Gunter Saake,Kai-Uwe Sattler 2004 "Report on the Dagstuhl Seminar: ""data quality on the Web""." SIGMOD Record UbiData: Requirements and Architecture for Ubiquitous Data Access. Abdelsalam Helal,Joachim Hammer 2004 Mobile users today demand ubiquitous access to their data from any mobile device and under variable connection quality. We refer to this requirement as any-time, any-where data access whose realization requires much more support for asynchronous and disconnected operation than is currently available from existing research prototypes or commercial products. Furthermore, the proliferation of mobile devices and applications, forges the additional requirement of device- and application-transparent data access. Support for such any-device, any-application computing paradigm requires the ability to store and manage data in a generic representation and to transform it for usage in different applications, which may also be running on different platforms. In this article, we give an overview of the UbiData architecture and prototype system and show how it addresses these challenging requirements. We also summarize our ongoing and future efforts. SIGMOD Record Optimization of Data Stream Processing. Janusz R. Getta,Ehsan Vossough 2004 Efficient processing of unlimited and continuously expanding sequences of data items is one of the key factors in the implementations of Data Stream Management Systems (DSMS). Analysis of stream processing at the dataflow level reveals execution plans which are not visible at a logical level. This work introduces a new model of data stream processing and discusses a number of optimization techniques applicable to this model and its implementation. The optimization techniques include applications of containers with intermediate results, analysis of data processing rates, and efficient synchronization of elementary operations on data streams. The paper also describes the translation of logical level expressions on data streams into the sets of dataflow level expressions, syntax based optimization of dataflow expression, and scheduling of concurrent computations of the dataflow expressions. SIGMOD Record Evaluating lock-based protocols for cooperation on XML documents. Sven Helmer,Carl-Christian Kanne,Guido Moerkotte 2004 We discuss four different core protocols for synchronizing access to and modifications of XML document collections. These core protocols synchronize structure traversals and modifications. They are meant to be integrated into a native XML base management System (XBMS) and are based on two phase locking. We also demonstrate the different degrees of cooperation that are possible with these protocols by various experimental results. Furthermore, we also discuss extensions of these core protocols to full-fledged protocols. Further, we show how to achieve a higher degree of concurrency by exploiting the semantics expressed in Document Type Definitions (DTDs). SIGMOD Record BIO-AJAX: An Extensible Framework for Biological Data Cleaning. Katherine G. Herbert,Narain H. Gehani,William H. Piel,Jason Tsong-Li Wang,Cathy H. Wu 2004 As databases become more pervasive through the biological sciences, various data quality issues regarding data legacy, data uniformity and data duplication arise. Due to the nature of this data, each of these problems is non-trivial. For biological data to be corrected and standardized, new methods and frameworks must be developed. This paper proposes one such framework, called BIO-AJAX, which uses principles from data cleaning to improve data quality in biological information systems, specifically in TreeBASE. SIGMOD Record Integration of Biological Sources: Current Systems and Challenges Ahead. Thomas Hernandez,Subbarao Kambhampati 2004 This paper surveys the area of biological and genomic sources integration, which has recently become a major focus of the data integration research field. The challenges that an integration system for biological sources must face are due to several factors such as the variety and amount of data available, the representational heterogeneity of the data in the different sources, and the autonomy and differing capabilities of the sources. This survey describes the main integration approaches that have been adopted. They include warehouse integration, mediator-based integration, and navigational integration. Then we look at the four major existing integration systems that have been developed for the biological domain: SRS, BioKleisli, TAMBIS, and DiscoveryLink. After analyzing these systems and mentioning a few others, we identify the pros and cons of the current approaches and systems and discuss what an integration system for biologists ought to be. SIGMOD Record Report on the first Twente Data Management Workshop on XML Databases and Information Retrieval. Djoerd Hiemstra,Vojkan Mihajlovic 2004 The Database Group of the University of Twente initiated a new series of workshops called Twente Data Management workshops (TDM), starting with one on XML Databases and Information Retrieval which took place on 21 June 2004 at the University of Twente. We have set ourselves two goals for the workshop series: i) To provide a forum to share original ideas as well as research results on data management problems; ii) To bring together researchers from the database community and researchers from related research fields SIGMOD Record Issues in Mechanical Engineering Design Management. Dave Hislop,Zoé Lacroix,Gerald Moeller 2004 The Virtual Parts Engineering Research Center (VPERC), funded by the Army Research Office, focuses on building frameworks, tools, and technologies for making engineered systems sustainable and maintainable thanks to a virtual engineering environment intended to transform the engineering process, thus supporting extremely fast turnaround times for urgent part supply needs. One of its key research thrusts is data management targeted for the design process. The invitational VPERC workshop held at Arizona State University in June 2003, involved 38 participants from academia, governmental institutions and industry who discussed legacy systems engineering. This paper presents the data management needs to support mechanical engineering design as they were discussed at the meeting. SIGMOD Record Logic-based Web Information Extraction. Georg Gottlob,Christoph Koch 2004 The Web wrapping proble, i.e., the problem of extracting structured information from HTML documents, is one of great practical importance. The often observed information overload that users of the Web experience witnesses the lack of intelligent and encompassing Web services that provide high-quality collected and value-added inforamtion. The Web wrapping problem has been addressed by a significant amount of research work. Previous work can be classified into two categories, depending on whether the HTML input is regarded as a sequential character string (e.g., [34, 27, 24, 30, 23]) or a pre-parsed document tree (for instance, [35, 25, 22, 29, 3, 2, 26]). The latter category of work thus assumes that systems may make use of an existing HTML parser as a front and. SIGMOD Record Introducing an Annotated Bibliography on Temporal and Evolution Aspects in the World Wide Web. Fabio Grandi 2004 Time is a pervasive dimension of reality as everything evolves as time elapses. Information systems and applications at least mirror, and often have to capture, the time-varying and evolutionary nature of the phenomena they model and the activities they support. This aspect has been acknowledged and long studied in the field of temporal databases but it truly applies also to the World Wide Web, although it has not seemingly considered as a primary issue yet. However, several papers addressing, in an explicit or implicit way, the representation and management of time and change in the World Wide Web appeared recently and, on some aspects, showed a clear upward trend in last months, witnessing a sustained and/or growing interest. SIGMOD Record Reminiscences on Influential Papers - Kenneth A. Ross. Zachary G. Ives,Bertram Ludäscher,Ioana Manolescu 2004 Reminiscences on Influential Papers - Kenneth A. Ross. SIGMOD Record Life Science Research and Data Management. Amarnath Gupta 2004 Life Science Research and Data Management. SIGMOD Record Database Management for Life Sciences Research. H. V. Jagadish,Frank Olken 2004 The life sciences provide a rich application domain for data management research, with a broad diversity of problems that can make a significant difference to progress in life sciences research. This article is an extract from the Report of the NSF Workshop on Data Management for Molecular and Cell Biology, edited by H. V. Jagadish and Frank Olken. The workshop was held at the National Library of Medicine, Bethesda, MD, Feb. 2-3, 2003. SIGMOD Record The GenAlg Project. Joachim Hammer,Markus Schneider 2004 The GenAlg Project. SIGMOD Record Report on the Workshop on Metadata Management in Grid and Peer-to-Peer Systems, London, December 16, 2003. Kevin Keenoy,Alexandra Poulovassilis,Vassilis Christophides,George Loizou,Giorgos Kokkinidis,George Samaras,Nicolas Spyratos 2004 A workshop on Metadata Management in Grid and Peer-to-Peer Systems was held in the Senate House of the University of London on December 16 2003.1 The workshop was organised by the SeLeNe (Self e-Learning Networks) IST project as part of its dissemination activities.2 The goal of the workshop was to identify recent technological achievements and open challenges regarding metadata management in novel applications requiring peer-to-peer information management in a distributed or Grid setting. The target audience for this event were researchers from the Grid, peer-to-peer and e-learning communities, as well as other application areas requiring Grid and/or peer-to-peer support. The event attracted 43 participants from 8 different European countries, and we believe that it was an important step in coordinating research activities in these inter-related areas. The presentations at the workshop fell into one of four sessions, each of which we report on below. SIGMOD Record Emergence and Convergence: Qualitative Novelty and the Unity of Knowledge By Mario Bunge. Haim Kilov 2004 Emergence and Convergence: Qualitative Novelty and the Unity of Knowledge By Mario Bunge. SIGMOD Record An Evaluation of XML Indexes for Structural Join. Hanyu Li,Mong-Li Lee,Wynne Hsu,Chao Chen 2004 XML queries differ from relational queries in that the former are expressed as path expressions. The efficient handling of structural relationships has become a key factor in XML query processing. Many index-based solutions have been proposed for efficient structural join in XML queries. This work explores the state-of-the-art indexes, namely, B+-tree, XB-tree and XR-tree, and analyzes how well they support XML structural joins. Experiment results indicate that all three indexes yield comparable performances for non-recursive XML data, while the XB-tree outperforms the rest for highly recursive XML data. SIGMOD Record Semantic Integration: A Survey Of Ontology-Based Approaches. Natalya Fridman Noy 2004 Semantic integration is an active area of research in several disciplines, such as databases, information-integration, and ontologies. This paper provides a brief survey of the approaches to semantic integration developed by researchers in the ontology community. We focus on the approaches that differentiate the ontology research from other related areas. The goal of the paper is to provide a reader who may not be very familiar with ontology research with introduction to major themes in this research and with pointers to different research projects. We discuss techniques for finding correspondences between ontologies, declarative ways of representing these correspondences, and use of these correspondences in various semantic-integration tasks SIGMOD Record Toward an ontology-enhanced information filtering agent. Kwang Mong Sim 2004 "Whereas search engines assist users in locating initial information sources, often an overwhelmingly large number of ULRs is returned, and the task of browsing websites rests heavily on users. The contribution of this work is developing an information filtering agent (IFA) that assists users in identifying out-of-context web pages and rating the relevance of web pages. An IFA determines the relevance of web pages by adopting three heuristics: (i) detecting evidence phrases (EP) constructed from WORDNET's ontology, (ii) counting the frequencies of EP and (iii) considering the nearness among keywords. Favorable experimental results show that the IFA's ratings of web pages are generally close to human ratings in many instances. The strength and weaknesses of the IFA are also discussed." SIGMOD Record "Report on the 3rd Web Dynamics Workshop, at WWW'2004." Mark Levene,Alexandra Poulovassilis 2004 "The web is highly dynamic in both the content and quantity of the information that it encompasses. In order to fully exploit its enormous potential as a global repository of information, we need to understand how its size, topology, and content are evolving. This then allows the development of new techniques for locating and retrieving information that are better able to adapt and scale to its change and growth. The web's users are highly diverse and can access the it from a variety of devices and interfaces, at different places and times, and for varying purposes. Thus, new techniques are being developed for personalising the presentation and content of web-based information depending on how it is being accessed and on the individual user's requirements and preferences. New applications in areas such as e-business, sensor networks, and mobile and ubiquitous computing need to be able to detect and react quickly to events and changes in web-based information. Traditional approaches using query-based `pull' of information to find out if events or changes of interest have occurred may not be able to scale to the quantity and frequency of events and changes being generated, and new `push'-based techniques are being deployed in which information producers automatically notify consumers when events or changes of interest to them occur. Semantic Web and Web Service technologies are being developed and adopted, with the aim of providing standard ways for web-based applications to share and personalise information." SIGMOD Record "Editor's Notes." Ling Liu 2004 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2004 "Editor's Notes." SIGMOD Record "Editor's Notes." Ling Liu 2004 "Editor's Notes." SIGMOD Record "Editor's Notes." Lin Liu 2004 "Editor's Notes." SIGMOD Record An Optimal Algorithm for Querying Tree Structures and its Applications in Bioinformatics. Hsiao-Fei Liu,Ya-Hui Chang,Kun-Mao Chao 2004 "Trees and graphs are widely used to model biological databases. Providing efficient algorithms to support tree-based or graph-based querying is therefore an important issue. In this paper, we propose an optimal algorithm which can answer the following question: ""Where do the root-to-leaf paths of a rooted labeled tree Q occur in another rooted labeled tree T?"" in time O(m + Occ), where m is the size of Q and Occ is the output size. We also show the problem of querying a general graph is NP-complete and not approximable within nk for any k < 1, where n is the number of nodes in the queried graph, unless P = NP." SIGMOD Record Report on the 9th International Workshop on Data Base Programming Languages. Georg Lausen,Dan Suciu 2004 Report on the 9th International Workshop on Data Base Programming Languages. SIGMOD Record QUASAR: quality aware sensing architecture. Iosif Lazaridis,Qi Han,Xingbo Yu,Sharad Mehrotra,Nalini Venkatasubramanian,Dmitri V. Kalashnikov,Weiwen Yang 2004 "Sensor devices are promising to revolutionize our interaction with the physical world by allowing continuous monitoring and reaction to natural and artificial processes at an unprecedented level of spatial and temporal resolution. As sensors become smaller, cheaper and more configurable, systems incorporating large numbers of them become feasible. Besides the technological aspects of sensor design, a critical factor enabling future sensor-driven applications will be the availability of an integrated infrastructure taking care of the onus of data management. Ideally, accessing sensor data should be no difficult or inconvenient than using simple SQL.In this paper we investigate some of the issues that such an infrastructure must address. Unlike conventional distributed database systems, a sensor data architecture must handle extremely high data generation rates from a large number of small autonomous components. And, unlike the emerging paradigm of data streams, it is infeasible to think that all this data can be streamed into the query processing site, due to severe bandwidth and energy constraints of battery-operated wireless sensors. Thus, sensing data architectures must become quality-aware, regulating the quality of data at all levels of the distributed system, and supporting user applications' quality requirements in the most efficient manner possible." SIGMOD Record "Report on the First ""XQuery Implementation, Experience, and Perspectives"" Workshop (XIME-P)." Ioana Manolescu,Yannis Papakonstantinou 2004 The XQuery Implementation, Experience and Perspectives (XIME-P) workshop was organized by Ioana Manolescu and Yannis Papakonstantinou in cooperation with the ACM SIGMOD Conference, and was held in Maison de la Chimie, in Paris, France, on June 17 and 18, 2004. This report summarizes the goals and topics of the workshop, presents the major workshop highlights and the main issues discussed during the workshop. SIGMOD Record Reconsidering Multi-Dimensional schemas. Tim Martyn 2004 "This paper challenges the currently popular ""Data Warehouse is a Special Animal"" philosophy and advocates that practitioners adopt a more conservative ""Data Warehouse=Database"" philosophy. The primary focus is the relevancy of Multi-Dimensional logical schemas. After enumerating the advantages of such schemas, a number of caveats to the presumed advantages are identified. The paper concludes with guidelines and commentary on implications for data warehouse design methodologies." SIGMOD Record Yahiko Kambayashi (February 15, 1943 - February 6, 2004) - A tribute and personal memoirs. Yoshifumi Masunaga,Katsumi Tanaka 2004 Yahiko Kambayashi (February 15, 1943 - February 6, 2004) - A tribute and personal memoirs. SIGMOD Record CiVeDi: A Customized Virtual Environment for Database Interaction. Pietro Mazzoleni,Elisa Bertino,Elena Ferrari,Stefano Valtolina 2004 This paper presents CiVedi, a scalable system providing a flexible and customizable virtual environment for displaying multimedia contents. Using CiVeDi, both the final users and the exhibition curators can personalize the content of the visit as well as the visit appearance and its duration. The proposed solution aims to be used transparently over different media objects either stored into a database or dynamically collected from online digital libraries. SIGMOD Record EDBT04 Workshop on Database Technologies for Handling XML Information on the Web. Marco Mesiti,Barbara Catania,Giovanna Guerrini,Akmal B. Chaudhri 2004 EDBT04 Workshop on Database Technologies for Handling XML Information on the Web. SIGMOD Record Multiplex, Fusionplex, and Autoplex - Three Generations of Information Integration. Amihai Motro,Jacob Berlin,Philipp Anokhin 2004 "We describe three generations of information integration systems developed at George Mason University. All three systems adopt a virtual database design: a global integration schema, a mapping between this schema and the schemas of the participating information sources, and automatic interpretation of global queries. The focus of Multiplex is rapid integration of very large, evolving, and heterogeneous collections of information sources. Fusionplex strengthens these capabilities with powerful tools for resolving data inconsistencies. Finally, Autoplex takes a more proactive approach to integration, by ""recruiting"" contributions to the global integration schema from available information sources. Using machine learning techniques it confronts a major cost of integration, that of mapping new sources into the global schema." SIGMOD Record Book Review: Managing Gigabytes: Compressing and Indexing documents and images - By Ian H. Witten, Alistair Moffat, and Timothy C. Bell (Second Edition). S. V. Nagaraj 2004 Book Review: Managing Gigabytes: Compressing and Indexing documents and images - By Ian H. Witten, Alistair Moffat, and Timothy C. Bell (Second Edition). SIGMOD Record Report from the First Workshop on Geo Sensor Networks. Silvia Nittel,Anthony Stefanidis,Isabel F. Cruz,Max J. Egenhofer,Dina Q. Goldin,A. Howard,Alexandros Labrinidis,Samuel Madden,Agnès Voisard,Michael F. Worboys 2004 Report from the First Workshop on Geo Sensor Networks. SIGMOD Record "Chair's Message." M. Tamer Özsu 2004 "Chair's Message." SIGMOD Record "Chair's Message" M. Tamer Özsu 2004 "Chair's Message" SIGMOD Record "Chair's Message." M. Tamer Özsu 2004 "Chair's Message." SIGMOD Record Statistical grid-based clustering over data streams. Nam Hun Park,Won Suk Lee 2004 A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics. SIGMOD Record Matching Large XML Schemas. Erhard Rahm,Hong Hai Do,Sabine Massmann 2004 Current schema matching approaches still have to improve for very large and complex schemas. Such schemas are increasingly written in the standard language W3C XML schema, especially in E-business applications. The high expressive power and versatility of this schema language, in particular its type system and support for distributed schemas and name-spaces, introduce new issues. In this paper, we study some of the important problems in matching such large XML schemas. We propose a fragment-oriented match approach to decompose a large match problem into several smaller ones and to reuse previous match results at the level of schema fragments. SIGMOD Record Spatial, Temporal and Spatio-Temporal Databases - Hot Issues and Directions for PhD Research. John F. Roddick,Max J. Egenhofer,Erik G. Hoel,Dimitris Papadias,Betty Salzberg 2004 Spatial, Temporal and Spatio-Temporal Databases - Hot Issues and Directions for PhD Research. SIGMOD Record From semantic integration to semantics management: case studies and a way forward. Arnon Rosenthal,Leonard J. Seligman,Scott Renner 2004 "For meaningful information exchange or integration, providers and consumers need compatible semantics between source and target systems. It is widely recognized that achieving this semantic integration is very costly. Nearly all the published research concerns how system integrators can discover and exploit semantic knowledge in order to better share data among the systems they already have. This research is very important, but to make the greatest impact, we must go beyond after-the-fact semantic integration among existing systems, to actively guiding semantic choices in new ontologies and systems - e.g., what concepts should be used as descriptive vocabularies for existing data, or as definitions for newly built systems. The goal is to ease data sharing for both new and old systems, to ensure that needed data is actually collected, and to maximize over time the business value of an enterprise's information systems." SIGMOD Record Reminiscences on Influential Papers. Kenneth A. Ross,Peter A. Boncz,Ihab F. Ilyas,Volker Markl,Vasilis Vassalos 2004 Reminiscences on Influential Papers. SIGMOD Record XPath query containment. Thomas Schwentick 2004 XPath query containment. SIGMOD Record Science of design for information systems: report of the NSF workshop, Seattle, 2003. 2004 The Workshop on Science of Design for Information Systems (SDIS2003) was held in Seattle, September 16 and 17, 2003. It was funded through a grant from the National Science Foundation2, with the goal of assessing the state-of-the-art in information systems design and suggesting promising directions for future research and development in this critical area.This short report is intended to provide an overview of the workshop report for the SIGMOD audience. In summary, we believe that there is a need to develop a new set of methodologies for information system design that cover advanced aspects of such systems. In particular, we are interested in techniques that offer guidance in the design of information systems that integrate data from multiple sources, handle dynamic aspects of the system (e.g., rapidly changing data, tracking provenance, version management), include aspects influenced by data location (e.g., cached objects and queries, peer-to-peer data sharing), model process-oriented issues (e.g., workflows, web-services), and account for the security and privacy of the data. AB@The interested reader can find a full version of this report at the workshop website, www.cs.wisc.edu/sdis03. SIGMOD Record A Distributed Database for BioMolecular Images. Ambuj K. Singh,B. S. Manjunath,Robert F. Murphy 2004 A Distributed Database for BioMolecular Images. SIGMOD Record TODS Special Issues. Richard T. Snodgrass 2004 TODS Special Issues. SIGMOD Record Developments at ACM TODS. Richard T. Snodgrass 2004 Developments at ACM TODS. SIGMOD Record Changes to the ACM TODS Editorial Board. Richard T. Snodgrass 2004 Changes to the ACM TODS Editorial Board. SIGMOD Record Branding Yourself. Richard T. Snodgrass,Merrie Brucks 2004 "Our here in the western US a ""brand"" is distinctive symbol placed on livestock (cattle, horses) to indicate their ranch of origin. Brands originated in the 1800's to deter thieves and to place lost animals. The brand is applied by heating an iron template to red-hot and then burning the brand into the hide of the animal. In that context, the title of this essay may be quite disturbing." SIGMOD Record Report on the 7th EDBT Summer School: XML and Databases. Riccardo Torlone,Paolo Atzeni 2004 Report on the 7th EDBT Summer School: XML and Databases. SIGMOD Record A secure hierarchical model for sensor network. Malik Ayed Tubaishat,Jian Yin,Biswajit Panja,Sanjay Kumar Madria 2004 In a distributed sensor network, large number of sensors deployed which communicate among themselves to self-organize a wireless ad hoc network. We propose an energy-efficient level-based hierarchical system. We compromise between the energy consumption and shortest path route by utilizing number of neighbors (NBR) of a sensor and its level in the hierarchical clustering. In addition, we design a Secure Routing Protocol for Sensor Networks (SRPSN) to safeguard the data packet passing on the sensor networks under different types of attacks. We build the secure route from the source node to sink node. The sink node is guaranteed to receive correct information using our SRPSN. We also propose a group key management scheme, which contains group communication policies, group membership requirements and an algorithm for generating a distributed group key for secure communication. SIGMOD Record Ontologies and Semantics for Seamless Connectivity. Michael Uschold,Michael Grüninger 2004 The goal of having networks of seamlessly connected people, software agents and IT systems remains elusive. Early integration efforts focused on connectivity at the physical and syntactic layers. Great strides were made; there are many commercial tools available, for example to assist with enterprise application integration. It is now recognized that physical and syntactic connectivity is not adequate. A variety of research systems have been developed addressing some of the semantic issues. In this paper, we argue that ontologies in particular and semantics-based technologies in general will play a key role in achieving seamless connectivity. We give a detailed introduction to ontologies, summarize the current state of the art for applying ontologies to achieve semantic connectivity and highlight some key challenges. SIGMOD Record State-of-the-art in privacy preserving data mining. Vassilios S. Verykios,Elisa Bertino,Igor Nai Fovino,Loredana Parasiliti Provenza,Yücel Saygin,Yannis Theodoridis 2004 We provide here an overview of the new and rapidly emerging research area of privacy preserving data mining. We also propose a classification hierarchy that sets the basis for analyzing the work which has been performed in this context. A detailed review of the work accomplished in this area is also given, along with the coordinates of each work to the classification hierarchy. A brief evaluation is performed, and some initial conclusions are made. SIGMOD Record Land below a DBMS. Kaladhar Voruganti,Jai Menon,Sandeep Gopisetty 2004 Land below a DBMS. SIGMOD Record Robust key establishment in sensor networks. Yongge Wang 2004 Secure communication guaranteeing reliability, authenticity, and privacy in sensor networks with active adversaries is a challenging research problem since asymmetric key cryptosystems are not suitable for sensor nodes with limited computation and communication capabilities. In most proposed secure communication protocols, sensor nodes need to contact the base station to get a session key first if two sensor nodes want to establish a secure communication channel (e.g., SPINS). In several environments, this may be impractical. In this paper, we study key agreement protocols for which two sensor nodes (who do not necessarily have a shared key from the key predistribution phase) could establish a secure communication channel against active adversaries (e.g., denial of service attacks) without the involvement of the base station. SIGMOD Record Report on the 2004 SIGMOD Conference. Gerhard Weikum,Patrick Valduriez 2004 Report on the 2004 SIGMOD Conference. SIGMOD Record Interview with Jeffrey Naughton. Marianne Winslett 2004 Interview with Jeffrey Naughton. SIGMOD Record Interview with Phil Bernstein. Marianne Winslett 2004 Interview with Phil Bernstein. SIGMOD Record Interview: C. Mohan Speaks Out. Marianne Winslett,C. Mohan 2004 Interview: C. Mohan Speaks Out. SIGMOD Record Web Services - By G. Alonso, F. Casati, H. Kuno, V. Machiraju. Dirk Wodtke 2004 Web Services - By G. Alonso, F. Casati, H. Kuno, V. Machiraju. ICDE ModelGen: Model Independent Schema Translation. Paolo Atzeni,Paolo Cappellari,Philip A. Bernstein 2005 A customizable and extensible tool is proposed to implement ModelGen, the model management operator that translates a schema from one model to another. A wide family of models is handled, by using a metamodel in which models can be succinctly and precisely described. The approach is novel because the tool exposes the dictionary that stores models, schemas, and the rules used to implement translations. In this way, the transformations can be customized and the tool can be easily extended. ICDE Adaptive Caching for Continuous Queries. Shivnath Babu,Kamesh Munagala,Jennifer Widom,Rajeev Motwani 2005 We address the problem of executing continuous multiway join queries in unpredictable and volatile environments. Our query class captures windowed join queries in data stream systems as well as conventional maintenance of materialized join views. Our adaptive approach handles streams of updates whose rates and data characteristics may change over time, as well as changes in system conditions such as memory availability. In this paper we focus specifically on the problem of adaptive placement and removal of caches to optimize join performance. Our approach automatically considers conventional tree-shaped join plans with materialized subresults at every intermediate node, subresult-free MJoins, and the entire spectrum between them. We provide algorithms for selecting caches, monitoring their cost and benefits in current conditions, allocating memory to caches, and adapting as conditions change. All of our algorithms are implemented in the STREAM prototype Data Stream Management System and a thorough experimental evaluation is included. ICDE Progressive Distributed Top k Retrieval in Peer-to-Peer Networks. Wolf-Tilo Balke,Wolfgang Nejdl,Wolf Siberski,Uwe Thaden 2005 Query processing in traditional information management systems has moved from an exact match model to more flexible paradigms allowing cooperative retrieval by aggregating the database objectsý degree of match for each different query predicate and returning the best matching objects only. In peer-to-peer systems such strategies are even more important, given the potentially large number of peers, which may contribute to the results. Yet current peer-to-peer research has barely started to investigate such approaches. In this paper we will discuss the benefits of best match/top-k queries in the context of distributed peer-to-peer information infrastructures and show how to extend the limited query processing in current peer-to-peer networks by allowing the distributed processing of top-k queries, while maintaining a minimum of data traffic. Relying on a super-peer backbone organized in the HyperCuP topology we will show how to use local indexes for optimizing the necessary query routing and how to process intermediate results in inner network nodes at the earliest possible point in time cutting down the necessary data traffic within the network. Our algorithm is based on dynamically collected query statistics only, no continuous index update processes are necessary, allowing it to scale easily to large numbers of peers, as well as dynamic additions/deletions of peers. We will show our approach to always deliver correct result sets and to be optimal in terms of necessary object accesses and data traffic. Finally, we present simulation results for both static and dynamic network environments. ICDE Constructing and Querying Peer-to-Peer Warehouses of XML Resources. Serge Abiteboul,Ioana Manolescu,Nicoleta Preda 2005 We present KADOP, a distributed infrastructure for warehousing XML resources in a peer-to-peer framework. KADOP allows users to build a shared, distributed repository of resources such as XML documents, semantic information about such documents, Web services, and collections of such items. KADOP leverages several existing technologies and models: it uses distributed hash tables as a peer communication layer, and ActiveXML as a model for constructing and querying the resources in the peer network. ICDE Web Services and Service-Oriented Architectures. Gustavo Alonso,Fabio Casati 2005 Web Services and Service-Oriented Architectures. ICDE A Comparative Evaluation of Transparent Scaling Techniques for Dynamic Content Servers. Cristiana Amza,Alan L. Cox,Willy Zwaenepoel 2005 We study several transparent techniques for scaling dynamic content web sites, and we evaluate their relative impact when used in combination. Full transparency implies strong data consistency as perceived by the user, nomodifications to existing dynamic content site tiers and no additional programming effort from the user or site administrator upon deployment. We study strategies for scheduling and load balancing queries on a cluster of replicated database back-ends. We also investigate transparent query caching as a means of enhancing database replication. Our work shows that, on an experimental platform with up to 8 database replicas, the various techniques work in synergy to improve overall scaling for the e-commerce TPCW benchmark. We rank the techniques necessary for high performance in order of impact as follows. Key among the strategies are scheduling strategies, such as conflict-aware scheduling, that minimize consistency maintainance over-heads. The choice of load balancing strategy is less important. Transparent query result caching increases performance significantly at any given cluster size for a mostly-read workload. Its benefits are limited for write-intensive workloads, where content-aware scheduling is the only scaling option. ICDE Extending Relational Database Systems to Automatically Enforce Privacy Policies. Rakesh Agrawal,Paul Bird,Tyrone Grandison,Jerry Kiernan,Scott Logan,Walid Rjaibi 2005 Databases are at the core of successful businesses. Due to the voluminous stores of personal data being held by companies today, preserving privacy has become a crucial requirement for operating a business. This paper proposes how current relational database management systems can be transformed into their privacy-preserving equivalents. Specifically, we present language constructs and implementation design for fine-grained access control to achieve this goal. ICDE A Framework for High-Accuracy Privacy-Preserving Mining. Shipra Agrawal,Jayant R. Haritsa 2005 To preserve client privacy in the data mining process, a variety of techniques based on random perturbation of individual data records have been proposed recently. In this paper, we present FRAPP, a generalized matrix-theoretic framework of random perturbation, which facilitates a systematic approach to the design of perturbation mechanisms for privacy-preserving mining. Specifically, FRAPP is used to demonstrate that (a) the prior techniques differ only in their choices for the perturbation matrix elements, and (b) a symmetric perturbation matrix with minimal condition number can be identified, maximizing the accuracy even under strict privacy guarantees. We also propose a novel perturbation mechanism wherein the matrix elements are themselves characterized as random variables, and demonstrate that this feature provides significant improvements in privacy at only a marginal cost in accuracy. The quantitative utility of FRAPP, which applies to random-perturbation-based privacy-preserving mining in general, is evaluated specifically with regard to frequent-itemset mining on a variety of real datasets. Our experimental results indicate that, for a given privacy requirement, substantially lower errors are incurred, with respect to both itemset identity and itemset support, as compared to the prior techniques. ICDE Database Architectures for New Hardware. Anastassia Ailamaki 2005 "Thirty years ago, DBMS stored data on disks and cached recently used data in main memory buffer pools, while designers worried about improving I/O performance and maximizing main memory utilization. Today, however, databases live in multi-level memory hierarchies that include disks, main memories, and several levels of processor caches. Four (often correlated) factors have shifted the performance bottleneck of data-intensive commercial workloads from I/O to the processor and memory subsystem. First, storage systems are becoming faster and more intelligent (now disks come complete with their own processors and caches). Second, modern database storage managers aggressively improve locality through clustering, hide I/O latencies using prefetching, and parallelize disk accesses using data striping. Third, main memories have become much larger and often hold the application's working set. Finally, the increasing memory/processor speed gap has pronounced the importance of processor caches to database performance. This tutorial will first survey the computer architecture and database literature on understanding and evaluating database application performance on modern hardware. We will present approaches and methodologies used to produce time breakdowns when executing database workloads on modern processors. We will contrast traditional methods that use system simulation to the more realistic, yet challenging use of hardware event counters. Then, we will survey techniques proposed in the literature to alleviate the problem and their evaluation. We will emphasize the importance and explain the challenges when determining the optimal data placement on all levels of memory hierarchy, and contrast to other approaches such as prefetching data and instructions. Finally, we will discuss open problems and future directions: Is it only the memory subsystem database software architects should worry about? How important are other decisions processors make to database workload behavior? Given the emerging multi-threaded, multi-processor computers with modular, deep cache hierarchies, how feasible is it to create database systems that will adapt to their environment and will automatically take full advantage of the underlying hierarchy?" ICDE Index Support for Frequent Itemset Mining in a Relational DBMS. Elena Baralis,Tania Cerquitelli,Silvia Chiusano 2005 Many efforts have been devoted to couple data mining activities with relational DBMSs, but a true integration into the relational DBMS kernel has been rarely achieved. This paper presents a novel indexing technique, which represents transactions in a succinct form, appropriate for tightly integrating frequent itemset mining in a relational DBMS. The data representation is complete, i.e., no support threshold is enforced, in order to allow reusing the index for mining itemsets with any support threshold. Furthermore, an appropriate structure of the stored information has been devised, in order to allow a selective access of the index blocks necessary for the current extraction phase. The index has been implemented into the PostgreSQL open source DBMS and exploits its physical level access methods. Experiments have been run for various datasets, characterized by different data distributions. The execution time of the frequent itemset extraction task exploiting the index is always comparable with and sometime faster than a C++ implementation of the FP-growth algorithm accessing data stored on a flat file. ICDE Distributed/Heterogeneous Query Processing in Microsoft SQL Server. José A. Blakeley,Conor Cunningham,Nigel Ellis,Balaji Rathakrishnan,Ming-Chuan Wu 2005 This paper presents an architecture overview of the distributed, heterogeneous query processor (DHQP) in the Microsoft SQL Server database system to enable queries over a large collection of diverse data sources. The paper highlights three salient aspects of the architecture. First, the system introduces well-defined abstractions such as connections, commands, and rowsets that enable sources to plug into the system. These abstractions are formalized by the OLE DB data access interfaces. The generality of OLE DB and its broad industry adoption enables our system to reach a very large collection of diverse data sources ranging from personal productivity tools, to database management systems, to file system data. Second, the DHQP is built-in to the relational optimizer and execution engine of the system. This enables DH queries and updates to benefit from the cost-based algebraic transformations and execution strategies available in the system. Finally, the architecture is inherently extensible to support new data sources as they emerge as well as serves as a key extensibility point for the relational engine to add new features such as full-text search and distributed partitioned views. ICDE Data Privacy through Optimal k-Anonymization. Roberto J. Bayardo Jr.,Rakesh Agrawal 2005 Data Privacy through Optimal k-Anonymization. ICDE Fuzzy Spatial Objects: An Algebra Implementation in SECONDO. Thomas Behr,Ralf Hartmut Güting 2005 This paper describes a data model for fuzzy spatial objects implemented as an algebra module in SECONDO. Furthermore, the graphical representation of such objects is discussed. ICDE Practical Data Management Techniques for Vehicle Tracking Data. Sotiris Brakatsoulas,Dieter Pfoser,Nectaria Tryfona 2005 A novel data source for assessing traffic conditions is floating car data (FCD) in the form of vehicle tracking data, or, in database terms, trajectory data. This work proposes practical data management techniques including data pre-processing, data modeling and indexing to support the analysis and the data mining of vehicle tracking data ICDE Full-fledged Algebraic XPath Processing in Natix. Matthias Brantner,Sven Helmer,Carl-Christian Kanne,Guido Moerkotte 2005 We present the first complete translation of XPath into an algebra, paving the way for a comprehensive, state-of-the-art XPath (and later on, XQuery) compiler based on algebraic optimization techniques. Our translation includes all XPath features such as nested expressions, position-based predicates and node-set functions. The translated algebraic expressions can be executed using the proven, scalable, iterator-based approach, as we demonstrate in form of a corresponding physical algebra in our native XML DBMS Natix. A first glance at performance results shows that even without further optimization of the expressions, we provide a competitive evaluation technique for XPath queries. ICDE Privacy and Ownership Preserving of Outsourced Medical Data. Elisa Bertino,Beng Chin Ooi,Yanjiang Yang,Robert H. Deng 2005 The demand for the secondary use of medical data is increasing steadily to allow for the provision of better quality health care. Two important issues pertaining to this sharing of data have to be addressed: one is the privacy protection for individuals referred to in the data; the other is copyright protection over the data. In this paper, we present a unified framework that seamlessly combines techniques of binning and digital watermarking to attain the dual goals of privacy and copyright protection. Our binning method is built upon an earlier approach of generalization and suppression by allowing a broader concept of generalization. To ensure data usefulness, we propose constraining Binning by usage metrics that define maximal allowable information loss, and the metrics can be enforced off-line. Our watermarking algorithm watermarks the binned data in a hierarchical manner by leveraging on the very nature of the data. The method is resilient to the generalization attack that is specific to the binned data, as well as other attacks intended to destroy the inserted mark. We prove that watermarking could not adversely interfere with binning, and implemented the framework. Experiments were conducted, and the results show the robustness of the proposed framework. ICDE Schema Matching using Duplicates. Alexander Bilke,Felix Naumann 2005 Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach. ICDE The XML Stream Query Processor SPEX. François Bry,Fatih Coskun,Serap Durmaz,Tim Furche,Dan Olteanu,Markus Spannagel 2005 The XML Stream Query Processor SPEX. ICDE A Unified Framework for Monitoring Data Streams in Real Time. Ahmet Bulut,Ambuj K. Singh 2005 "Online monitoring of data streams poses a challenge in many data-centric applications, such as telecommunications networks, traffic management, trend-related analysis, web-click streams, intrusion detection, and sensor networks. Mining techniques employed in these applications have to beefficient in terms of space usage and per-item processing time while providing a high quality of answers to (1) aggregate monitoring queries, such as finding surprising levels of a data stream, detecting bursts, and to (2) similarity queries, such as detecting correlations and finding interesting patterns. The most important aspect of these tasks is their need for flexible query lengths, i.e., it is difficult to set the appropriate lengths a priori. For example, bursts of events can occur at variable temporal modalities from hours to days to weeks. Correlated trends can occur at various temporal scales. The system has to discover ""interesting"" behavior online and monitor over flexible window sizes. In this paper, we propose a multi-resolution indexing scheme, which handles variable length queries efficiently. We demonstrate the effectiveness of our framework over existing techniques through an extensive set of experiments." ICDE Vectorizing and Querying Large XML Repositories. Peter Buneman,Byron Choi,Wenfei Fan,Robert Hutchison,Robert Mann,Stratis Viglas 2005 Vertical partitioning is a well-known technique for optimizing query performance in relational databases. An extreme form of this technique, which we call vectorization, is to store each column separately. We use a generalization of vectorization as the basis for a native XML store. The idea is to decompose an XML document into a set of vectors that contain the data values and a compressed skeleton that describes the structure. In order to query this representation and produce results in the same vectorized format, we consider a practical fragment of XQuery and introduce the notion of query graphs and a novel graph reduction algorithm that allows us to leverage relational optimization techniques as well as to reduce the unnecessary loading of data vectors and decompression of skeletons. A preliminary experimental study based on some scientific and synthetic XML data repositories in the order of gigabytes supports the claim that these techniques are scalable and have the potential to provide performance comparable with established relational database technology. ICDE On the Signature Trees and Balanced Signature Trees. Yangjun Chen 2005 Advanced database application areas, such as computer aided design, office automation, digital libraries, data-mining as well as hypertext and multimedia systems need to handle complex data structures with set-valued attributes, which can be represented as bit strings, called signatures. A set of signatures can be stored in a file, called a signature file. In this paper, we propose a new method to organize a signature file into a tree structure, called a signature tree, to speed up the signature file scanning and query evaluation. ICDE PnP: Parallel And External Memory Iceberg Cubes. Ying Chen,Frank K. H. A. Dehne,Todd Eavis,Andrew Rau-Chaplin 2005 PnP: Parallel And External Memory Iceberg Cubes. ICDE ViteX: A Streaming XPath Processing System. Yi Chen,Susan B. Davidson,Yifeng Zheng 2005 We present ViteX, an XPath processing system on XML streams with polynomial time complexity. ViteX uses a polynomial-space data structure to encode an exponential number of pattern matches (in the query size) which are required to process queries correctly during a single sequential scan of XML. Then ViteX computes query solutions by probing the data structure in a lazy fashion without enumerating pattern matches. ICDE Assuring Security Properties in Third-party Architectures. Barbara Carminati,Elena Ferrari,Elisa Bertino 2005 Assuring Security Properties in Third-party Architectures. ICDE Efficient Algorithms for Pattern Matching on Directed Acyclic Graphs. Li Chen,Amarnath Gupta,M. Erdem Kurul 2005 Recently graph data models have become increasingly popular in many scientific fields. Efficient query processing over such data is critical. Existing works often rely on index structures that store pre-computed transitive relations to achieve efficient graph matching. In this paper, we present a family of stack-based algorithms to handle path and twig pattern queries for directed acyclic graphs (DAGs) in particular. With the worst-case space cost linearly bounded by the number of edges in the graph, our algorithms achieve a quadratic runtime complexity in the average size of the query variable bindings. This is optimal among the navigation-based graph matching algorithms. ICDE Panel on Business Process Intelligence. Malú Castellanos,Fabio Casati 2005 Panel on Business Process Intelligence. ICDE iBOM: A Platform for Intelligent Business Operation Management. Malú Castellanos,Fabio Casati,Ming-Chien Shan,Umeshwar Dayal 2005 As IT systems become more and more complex and as business operations become increasingly automated, there is a growing need from business managers to have better control on business operations and on how these are aligned with business goals. This paper describes iBOM, a platform for business operation management developed by HP that allows users to i) analyze operations from a business perspective and manage them based on business goals; ii) define business metrics, perform intelligent analysis on them to understand causes of undesired metric values, and predict future values; iii) optimize operations to improve business metrics. A key aspect is that all this functionality is readily available almost at the click of the mouse. The description of the work proceeds from some specific requirements to the solution developed to address them. We also show that the platform is indeed general, as demonstrated by subsequent deployment domains other than finance. ICDE Paradigm Shift to New DBMS Architectures: Research Issues and Market Needs. Sang Kyun Cha,Anastassia Ailamaki,Yoshinori Hara,Vishal Sikka 2005 Paradigm Shift to New DBMS Architectures: Research Issues and Market Needs. ICDE Efficient Processing of Skyline Queries with Partially-Ordered Domains. Chee Yong Chan,Pin-Kwang Eng,Kian-Lee Tan 2005 Efficient Processing of Skyline Queries with Partially-Ordered Domains. ICDE GPIVOT: Efficient Incremental Maintenance of Complex ROLAP Views. Songting Chen,Elke A. Rundensteiner 2005 Data warehousing and on-line analytical processing (OLAP) are essential for decision support applications. Common OLAP operations include for example drill down, roll up, pivot and unpivot. Typically, such queries are fairly complex and are often executed over huge volumes of data. The solution in practice is to use materialized views to reduce the query cost. Utilizing materialized views that incorporate not just traditional simple SELECT-PROJECT-JOIN operators but also complex OLAP operators such as pivot and unpivot is crucial to improve the OLAP query performance but as of now unexplored topic. In this work, we demonstrate that the efficient maintenance of views with pivot and unpivot operators requires the definition of more generalized operators, which we call GPIVOT and GUNPIVOT. We propose rewriting rules, combination rules and propagation rules for such operators. We also design a novel view maintenance framework for applying these rules to obtain an efficient maintenance plan. Our query transformation rules are thus dual purpose serving both view maintenance and query optimization. This paves the way for the inclusion of the GPIVOT and GUNPIVOT into any DBMS engine. ICDE An Enhanced Query Model for Soccer Video Retrieval Using Temporal Relationships. Shu-Ching Chen,Mei-Ling Shyu,Na Zhao 2005 An Enhanced Query Model for Soccer Video Retrieval Using Temporal Relationships. ICDE Change Tolerant Indexing for Constantly Evolving Data. Reynold Cheng,Yuni Xia,Sunil Prabhakar,Rahul Shah 2005 Index structures are designed to optimize search performance, while at the same time supporting efficient data updates. Although not explicit, existing index structures are typically based upon the assumption that the rate of updates will be small compared to the rate of querying. This assumption is not valid in streaming data environments such as sensor and moving object databases, where updates are received incessantly. In fact, for many applications, the rate of updates may well exceed the rate of querying. In such environments, index structures suffer from poor performance due to the large overhead of keeping the index updated with the latest data. Recent efforts at indexing moving object data assume objects move in a restrictive manner (e.g. in straight lines with constant velocity). In this paper, we propose an index structure explicitly designed to perform well for both querying and updating. We assume a more relaxed model of object movement. In particular, we observe that objects often stay in a region (e.g., building) for an extended amount of time, and exploit this phenomenon to optimize an index for both updates and queries. The paper is developed with the example of R-trees, but the ideas can be extended to other index structures as well. We present the design of the Change Tolerant R-tree, and an experimental evaluation. ICDE Architecture and Performance of Application Networking in Pervasive Content Delivery. Mu Su,Chi-Hung Chi 2005 This paper proposes Application Networking (App.Net) architecture that enables Web server to deploy intermediate response with associated service logic to edge proxies in form of a workflow. The recipient proxy is allowed to instantiate the workflow using local services or by downloading mobile applications from remote sites. The final response is the output from the workflow execution fed with the intermediate result. In this paper, we defined a workflow modulation method to represent service logic and manipulate it in uniform operations. An App.Net caching method is designed to cache intermediate results as well as final response presentations. According to cost model measuring the bandwidth usage, we designed workflow placement algorithms to deploy intermediate response objects to the proxy in optimal or greedy ways. Finally, we developed an App.Net prototype and implemented a wide range of applications existed in the Web. Our simulation results show that using workflow and dynamic application deployment, App.Net can achieve better performance than conventional methods. ICDE Design, Implementation, and Evaluation of a Repairable Database Management System. Tzi-cker Chiueh,Dhruv Pilania 2005 "Although conventional database management systems are designed to tolerate hardware and to a lesser extent even software errors, they cannot protect themselves against syntactically correct and semantically damaging transactions, which could arise because of malicious attacks or honest mistakes. The lack of fast post-intrusion or post-error damage repair in modern DBMSs results in a longer Mean Time to Repair (MTTR) and sometimes permanent data loss that could have been saved by more intelligent repair mechanisms. In this paper, we describe the design and implementation of Phoenix - a system that significantly improves the efficiency and precision of a database damage repair process after an intrusion or operator error and thus, increases the overall database system availability. The two key ideas underlying Phoenix are (1) maintaining persistent inter-transaction dependency information at run time to allow selective undo of database transactions that are considered ""infected"" by the intrusion or error in question and (2) exploiting information present in standard database logs for fast selective undo. Performance measurements on a fully operational Phoenix prototype, which is based on the PostgreSQL DBMS, demonstrate that Phoenix incurs a response time and a throughput penalty of less than 5% and 8%, respectively, under the TPC-C benchmark, but it can speed up the post-intrusion database repair process by at least an order of magnitude when compared with a manual repair process." ICDE Effective Computation of Biased Quantiles over Data Streams. Graham Cormode,Flip Korn,S. Muthukrishnan,Divesh Srivastava 2005 "Skew is prevalent in many data sources such as IP traffic streams. To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two problems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the ""high-biased"" and the ""targeted"" quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures. Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams." ICDE QoSMig: Adaptive Rate-Controlled Migration of Bulk Data in Storage Systems. Koustuv Dasgupta,Sugata Ghosal,Rohit Jain,Upendra Sharma,Akshat Verma 2005 Logical reorganization of data and requirements of differentiated QoS in information systems necessitate bulk data migration by the underlying storage layer. Such data migration needs to ensure that regular client I/Os are not impacted significantly while migration is in progress. We formalize the data migration problem in a unified admission control framework that captures both the performance requirements of client I/Os and the constraints associated with migration. We propose an adaptive rate-control based data migration methodology, QoSMig, that achieves the optimal client performance in a differentiated QoS setting, while ensuring that the specified migration constraints are met. QoSMig uses both long term averages and short term forecasts of client traffic to compute a migration schedule. We present an architecture based on Service Level Enforcement Discipline for Storage (SLEDS) that supports QoSMig. Our trace-driven experimental study demonstrates that QoSMig provides significantly better I/O performance as compared to existing migration methodologies. ICDE Exploiting Correlated Attributes in Acquisitional Query Processing. Amol Deshpande,Carlos Guestrin,Wei Hong,Samuel Madden 2005 Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high per-attribute acquisition cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we observe that it can be beneficial to use correlations to automatically introduce low-cost attributes whose observation will allow the query processor to better estimate the selectivity of these expensive predicates. In particular, we show how to build conditional plans that branch into one or more sub-plans, each with a different ordering for the expensive query predicates, based on the runtime observation of low-cost attributes. We frame the problem of constructing the optimal conditional plan for a given user query and set of candidate low-cost attributes as an optimization problem. We describe an exponential time algorithm for finding such optimal plans, and describe a polynomial-time heuristic for identifying conditional plans that perform well in practice. We also show how to compactly model conditional probability distributions needed to identify correlations and build these plans. We evaluate our algorithms against several real-world sensor-network data sets, showing several-times performance increases for a variety of queries versus traditional optimization techniques. ICDE MoDB: Database System for Synthesizing Human Motion. Timothy Edmunds,S. Muthukrishnan,Subarna Sadhukhan,Shinjiro Sueda 2005 "Enacting and capturing real motion for all potential scenarios is prohibitively expensive; hence, there is a great demand to synthetically generate realistic human motion. However, it is a central challenge in character animation to synthetically generate a large sequence of smooth human motion. We present a novel, database-centric solution to address this challenge.We demonstrate a method of generating long sequences of motion by performing various similarity-based ""joins"" on a database of captured motion sequences. This demo illustrates our system (MoDB) and show-cases the process of encoding captured motion into relational data and generating realistic motion by concatenating sub-sequences of the captured data according to feasibility metrics. The demo features an interactive character that moves towards user-specified targets; the characterýs motion is generated by relying on the real time performance of the database for indexing and selection of feasible sub-sequences." ICDE Batched Processing for Information Filters. Peter M. Fischer,Donald Kossmann 2005 This paper describes batching, a novel technique in order to improve the throughput of an information filter (e.g. message broker or publish & subscribe system). Rather than processing each message individually, incoming messages are reordered, grouped and a whole group of similar messages is processed. This paper presents alternative strategies to do batching. Extensive performance experiments are conducted on those strategies in order to compare their tradeoffs. ICDE Top-Down Specialization for Information and Privacy Preservation. Benjamin C. M. Fung,Ke Wang,Philip S. Yu 2005 Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures forclassification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity. ICDE Text Classification without Labeled Negative Documents. Gabriel Pui Cheong Fung,Jeffrey Xu Yu,Hongjun Lu,Philip S. Yu 2005 This paper presents a new solution for the problem of building a text classifier with a small set of labeled positive documents (P) and a large set of unlabeled documents (U). Here, the unlabeled documents are mixed with both of the positive and negative documents. In other words, no document is labeled as negative. This makes the task of building a reliable text classifier challenging. In general, the existing approaches for solving this kind of problem use a two-step approach: i) extract the negative documents (N) from U; and ii) build a classifier based on P and N. However, none of the reported studies tries to further extract any positive documents (P¿) from U. Intuitively, extracting P¿ from U will increase the reliability of the classifier. However, extracting P¿ from U is difficult. A document in U that possesses some of the features exhibited in P does not necessarily mean that it is a positive document, and vice versa. It is very sensitive to extract positive documents, because those extracted positive samples may become noises. The very large size of U and the very high diversity exhibited there also contribute to the difficulty of extracting any positive documents. In this paper, we propose a partitionbased heuristic which aims at extracting both of the positive and negative documents in U. Extensive experiments based on three benchmarks are conducted. The favorable results indicated that our proposed heuristic outperforms all of the existing approaches significantly, especially in the case where the size of P is extremely small. ICDE Adlib: A Self-Tuning Index for Dynamic P2P Systems. Prasanna Ganesan,Qixiang Sun,Hector Garcia-Molina 2005 Adlib: A Self-Tuning Index for Dynamic P2P Systems. ICDE Robust Identification of Fuzzy Duplicates. Surajit Chaudhuri,Venkatesh Ganti,Rajeev Motwani 2005 Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. ICDE Clustering Aggregation. Aristides Gionis,Heikki Mannila,Panayiotis Tsaparas 2005 We consider the following problem: given a set of clusterings, find a single clustering that agrees as much as possible with the input clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the clustering aggregation problem; each categorical attribute can be viewed as a clustering of the input rows where rows are grouped together if they take the same value on that attribute. Clustering aggregation can also be used as a metaclustering method to improve the robustness of clustering by combining the output of multiple algorithms. Furthermore, the problem formulation does not require a priori information about the number of clusters; it is naturally determined by the optimization function. In this article, we give a formal statement of the clustering aggregation problem, and we propose a number of algorithms. Our algorithms make use of the connection between clustering aggregation and the problem of correlation clustering. Although the problems we consider are NP-hard, for several of our methods, we provide theoretical guarantees on the quality of the solutions. Our work provides the best deterministic approximation algorithm for the variation of the correlation clustering problem we consider. We also show how sampling can be used to scale the algorithms for large datasets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions. ICDE Bloom Filter-based XML Packets Filtering for Millions of Path Queries. Xueqing Gong,Ying Yan,Weining Qian,Aoying Zhou 2005 Bloom Filter-based XML Packets Filtering for Millions of Path Queries. ICDE Spatio-Temporal Databases in Practice: Directly Supporting Previously Developed Land Data Using Tripod. Tony Griffiths,Alvaro A. A. Fernandes,Norman W. Paton,Seung-Hyun Jeong,Nassima Djafri,Keith T. Mason 2005 Spatio-Temporal Databases in Practice: Directly Supporting Previously Developed Land Data Using Tripod. ICDE Energy-efficient Data Organization and Query Processing in Sensor Networks. Ramakrishna Gummadi,Xin Li,Ramesh Govindan,Cyrus Shahabi,Wei Hong 2005 Recent sensor networks research has produced a class of data storage and query processing techniques called Data-Centric Storage that leverages locality-preserving distributed indexes to efficiently answer multi-dimensional range and range-aggregate queries. These distributed indexes offer a rich design space of a) logical decompositions of sensor relation schema into indexes, as well as b) physical mappings of these indexes onto sensors. In this paper, we explore this space for energy-efficient data organizations (logical and physical mappings of tuples and attributes to sensor nodes) and devise purely local query optimization techniques for processing queries that span such decomposed relations. ICDE Efficient Inverted Lists and Query Algorithms for Structured Value Ranking in Update-Intensive Relational Databases. Lin Guo,Jayavel Shanmugasundaram,Kevin S. Beyer,Eugene J. Shekita 2005 We propose a new ranking paradigm for relational databases called Structured Value Ranking (SVR). SVR uses structured data values to score (rank) the results of keyword search queries over text columns. Our main contribution is a new family of inverted list indices and associated query algorithms that can support SVR efficiently in update-intensive databases, where the structured data values (and hence the scores of documents) change frequently. Our experimental results on real and synthetic data sets using BerkeleyDB show that we can support SVR efficiently in relational databases. ICDE SECONDO: An Extensible DBMS Platform for Research Prototyping and Teaching. Ralf Hartmut Güting,Victor Teixeira de Almeida,Dirk Ansorge,Thomas Behr,Zhiming Ding,Thomas Höse,Frank Hoffmann,Markus Spiekermann,Ulrich Telle 2005 SECONDO: An Extensible DBMS Platform for Research Prototyping and Teaching. ICDE THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches. Joachim Hammer,Michael Stonebraker,Oguzhan Topsakal 2005 THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches. ICDE Cache-Conscious Automata for XML Filtering. Bingsheng He,Qiong Luo,Byron Choi 2005 Hardware cache behavior is an important factor in the performance of memory-resident, data-intensive systems such as XML filtering engines. A key data structure in several recent XML filters is the automaton, which is used to represent the long-running XML queries in the main memory. In this paper, we study the cache performance of automaton-based XML filtering through analytical modeling and system measurement. Furthermore, we propose a cache-conscious automaton organization technique, called the hot buffer, to improve the locality of automaton state transitions. Our results show that (1) our cache performance model for XML filtering automata is highly accurate and (2) the hot buffer improves the cache performance as well as the overall performance of automaton-based XML filtering. ICDE Asymmetric Batch Incremental View Maintenance. Hao He,Junyi Xie,Jun Yang,Hai Yu 2005 "Incremental view maintenance has found a growing number of applications recently, including data warehousing, continuous query processing, publish/subscribe systems, etc. Batch processing of base table modifications, when applicable, can be much more efficient than processing individual modifications one at a time. In this paper, we tackle the problem of finding the most efficient batch incremental maintenance strategy under a refresh response time constraint; that is, at any point in time, the system, upon request, must be able to bring the view up to date within a specified amount of time. The traditional approach is to process all batched modifications relevant to the view whenever the constraint is violated. However, we observe that there often exists natural asymmetry among different components of the maintenance cost; for example,modifications on one base table might be cheaper to process than those on another base table because of some index. We exploit such asymmetries using an unconventional strategy that selectively processes modifications on some base tables while keeping batching others. We present a series of analytical results leading to the development of practical algorithms that approximate an ""oracle algorithm"" with perfect knowledge of the future. With experiments on a TPC-R database, we demonstrate that our strategy offers substantial performance gains over traditional deferred view maintenance techniques." ICDE Towards Building a MetaQuerier: Extracting and Matching Web Query Interfaces. Bin He,Zhen Zhang,Kevin Chen-Chuan Chang 2005 Towards Building a MetaQuerier: Extracting and Matching Web Query Interfaces. ICDE Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets. Michael E. Houle,Jun Sakuma 2005 This paper introduces a practical index for approximate similarity queries of large multi-dimensional data sets: the spatial approximation sample hierarchy (SASH). A SASH is a multi-level structure of random samples, recursively constructed by building a SASH on a large randomly selected sample of data objects, and then connecting each remaining object to several of their approximate nearest neighbors from within the sample. Queries are processed by first locating approximate neighbors within the sample, and then using the pre-established connections to discover neighbors within the remainder of the data set. The SASH index relies on a pairwise distance measure, but otherwise makes no assumptions regarding the representation of the data. Experimental results are provided for query-by-example operations on protein sequence, image, and text data sets, including one consisting of more than 1 million vectors spanning more than 1.1 million terms ¿ far in excess of what spatial search indices can handle efficiently. For sets of this size, the SASH can return a large proportion of the true neighbors roughly 2 orders of magnitude faster than sequential search. ICDE Querying and Visualizing Gridded Datasets for e-Science. Bill Howe,David Maier 2005 Querying and Visualizing Gridded Datasets for e-Science. ICDE Rank-Aware Query Processing and Optimization. Ihab F. Ilyas,Walid G. Aref 2005 Rank-Aware Query Processing and Optimization. ICDE RDF Aggregate Queries and Views. Edward Hung,Yu Deng,V. S. Subrahmanian 2005 Resource Description Framework (RDF) is a rapidly expanding web standard. RDF databases attempt to track the massive amounts of web data and services available. In this paper, we study the problem of aggregate queries. We develop an algorithm to compute answers to aggregate queries over RDF databases and algorithms to maintain views involving those aggregates. Though RDF data can be stored in a standard relational DBMS (and hence we can execute standard relational aggregate queries and view maintenance methods on them), we show experimentally that our algorithms that operate directly on the RDF representation exhibit significantly superior performance. ICDE Proactive Caching for Spatial Queries in Mobile Environments. Haibo Hu,Jianliang Xu,Wing Sing Wong,Baihua Zheng,Dik Lun Lee,Wang-Chien Lee 2005 Semantic caching enables mobile clients to answer spatial queries locally by storing the query descriptions together with the results. However, it supports only a limited number of query types, and sharing results among these types is difficult. To address these issues, we propose a proactive caching model which caches the result objects as well as the index that supports these objects as the results. The cached index enables the objects to be reused for all common types of queries. We also propose an adaptive scheme to cache such an index, which further optimizes the query response time for the best user experience. Simulation results show that proactive caching achieves a significant performance gain over page caching and semantic caching in mobile environments where wireless bandwidth and battery are precious resources. ICDE High-Availability Algorithms for Distributed Stream Processing. Jeong-Hyon Hwang,Magdalena Balazinska,Alex Rasin,Ugur Çetintemel,Michael Stonebraker,Stanley B. Zdonik 2005 Stream-processing systems are designed to support an emerging class of applications that require sophisticated and timely processing of high-volume data streams, often originating in distributed environments. Unlike traditional data-processing applications that require precise recovery for correctness, many stream-processing applications can tolerate and benefit from weaker recovery guarantees. In this paper, we study various recovery guarantees and pertinent recovery techniques that can meet the correctness and performance requirements of stream-processing applications. We discuss the design and algorithmic challenges associated with the proposed recovery techniques and describe how each can provide different guarantees with proper combinations of redundant processing, checkpointing, and remote logging. Using analysis and simulations, we quantify the cost of our recovery guarantees and examine the performance and applicability of the recovery techniques. We also analyze how the knowledge of query network properties can help decrease the cost of high availability. ICDE Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach. Seung-won Hwang,Kevin Chen-Chuan Chang 2005 Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach. ICDE Modeling and Managing Content Changes in Text Databases. Panagiotis G. Ipeirotis,Alexandros Ntoulas,Junghoo Cho,Luis Gravano 2005 "Large amounts of (often valuable) information are stored in web-accessible text databases. ""Metasearchers"" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use ""survival analysis"" techniques in general, and Coxýs proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases." ICDE TRMeister: A DBMS with High-Performance Full-Text Search Functions. Tetsuya Ikeda,Hiroko Mano,Hideo Itoh,Hiroshi Takegawa,Takuya Hiraoka,Shiroh Horibe,Yasushi Ogawa 2005 TRMeister is a DBMS with high-performance full-text search functions. With TRMeister, high-speed full-text search, including high-precision ranking search in addition to Boolean search, is possible. Further, in addition to search, high-speed insert and delete are possible, allowing full-text search to be used in the same way as other types of database search in which data can be searched right after data is inserted. This makes it easy to combine normal attribute search with full-text search and thus easily create text search applications. ICDE DCbot: Exploring the Web as Value-Added Service for Location-Based Applications. Mihály Jakob,Matthias Großmann,Nicola Hönle,Daniela Nicklas 2005 DCbot: Exploring the Web as Value-Added Service for Location-Based Applications. ICDE Network-Based Problem Detection for Distributed Systems. Hisashi Kashima,Tadashi Tsumura,Tsuyoshi Idé,Takahide Nogayama,Ryo Hirade,Hiroaki Etoh,Takeshi Fukuda 2005 We introduce a network-based problem detection framework for distributed systems, which includes a data-mining method for discovering dynamic dependencies among distributed services from transaction data collected from network, and a novel problem detection method based on the discovered dependencies. From observed containments of transaction execution time periods, we estimate the probabilities of accidental and non-accidental containments, and build a competitive model for discovering direct dependencies by using a model estimation method based on the online EM algorithm. Utilizing the discovered dependency information, we also propose a hierarchical problem detection framework, where microscopic dependency information is incorporated with a macroscopic anomaly metric that monitors the behavior of the system as a whole. This feature is made possible by employing a network-based design which provides overall information of the system without any impact on the performance. ICDE NFMi: An Inter-domain Network Fault Management System. Qingchun Jiang,Raman Adaikkalavan,Sharma Chakravarthy 2005 Network fault management has been an active research area for a long period of time because of its complexity, and the returns it generates for service providers. However, most fault management systems are currently custom-developed for a particular domain. As communication service providers continuously add greater capabilities and sophistication to their systems in order to meet demands of a growing user population, these systems have to manage a multi-layered network along with its built-in legacy logical processing procedure. Stream processing has been receiving a lot of attention to deal with applications that generate large amounts of data in real-time at varying input rates and to compute functions over multiple streams, such as network fault management. In this paper, we propose an integrated inter-domain network fault management system for such a multi-layered network based on data stream and event processing techniques. We discuss various components in our system and how data stream processing techniques are used to build a flexible system for a sophisticated real-world application. We further identify a number of important issues related to data stream processing during the course of the discussion of our proposed system, which will further extend the boundaries of data stream processing. ICDE Knowledge Discovery from Transportation Network Data. Wei Jiang,Jaideep Vaidya,Zahir Balaporia,Chris Clifton,Brett Banich 2005 Transportation and Logistics are a major sector of the economy, however data analysis in this domain has remained largely in the province of optimization. The potential of data mining and knowledge discovery techniques is largely untapped. Transportation networks are naturally represented as graphs. This paper explores the problems in mining of transportation network graphs: We hope to find how current techniques both succeed and fail on this problem, and from the failures, we hope to present new challenges for data mining. Experimental results from applying both existing graph mining and conventional data mining techniques to real transportation network data are provided, including new approaches to making these techniques applicable to the problems. Reasons why these techniques are not appropriate are discussed. We also suggest several challenging problems to precipitate research and galvanize future work in this area. ICDE A Probabilistic XML Approach to Data Integration. Maurice van Keulen,Ander de Keijzer,Wouter Alink 2005 In mobile and ambient environments, devices need to become autonomous, managing and resolving problems without interference from a user. The database of a (mobile) device can be seen as its knowledge about objects in the ýreal worldý. Data exchange between small and/or large computing devices can be used to supplement and update this knowledge whenever a connection gets established. In many situations, however, data from different data sources referring to the same real world objects, may conflict. It is the task of the data management system of the device to resolve such conflicts without interference from a user. In this paper, we take a first step in the development of a probabilistic XML DBMS. The main idea is to drop the assumption that data in the database should be certain: subtrees in XML documents may denote possible views on the real world. We formally define the notion of probabilistic XML tree and several operations thereon. We also present an approach for determining a logical semantics for queries on probabilistic XML data. Finally, we introduce an approach for XML data integration where conflicts are resolved by the introduction of possibilities in the database. ICDE Improving Performance of Cluster-based Secure Application Servers with User-level Communication. Jin-Ha Kim,Gyu Sang Choi,Chita R. Das 2005 Improving Performance of Cluster-based Secure Application Servers with User-level Communication. ICDE RelaxImage: A Cross-Media Meta-Search Engine for Searching Images from Web Based on Query Relaxation. Akihiro Kuwabara,Katsumi Tanaka 2005 "We introduce a cross-media meta-search engine RelaxImage for searching images from Web. Notable features of the RelaxImage are as follows: (1) Each userýs keyword query is ""relaxed"", that is, by gradually relaxing the search terms used for image search, we can solve the problem of conventional image search engine such as Google. (2) For searching images, our RelaxImage sends a different keyword-query to each search engine of different media-type. We show several examples of how the relaxation approach works as well as ways that it can be applied. That is, our RelaxImage shows a great improvement for increasing recall ratio without decreasing of precision ratio." ICDE VLEI code: An Efficient Labeling Method for Handling XML Documents in an RDB. Kazuhito Kobayashi,Wenxin Liang,Dai Kobayashi,Akitsugu Watanabe,Haruo Yokota 2005 A number of XML labeling methods have been proposed to store XML documents in relational databases. However, they have a vulnerable point, in insertion operations. We propose the Variable Length Endless Insertable (VLEI) code and apply it to XML labeling to reduce the cost of insertion operations. Results of our experiments indicate that a combination of the VLEI code and Dewey order is effective for handling skewed insertions. ICDE Snapshot Queries: Towards Data-Centric Sensor Networks. Yannis Kotidis 2005 In this paper we introduce the idea of snapshot queries for energy efficient data acquisition in sensor networks. Network nodes generate models of their surrounding environment that are used for electing, using a localized algorithm, a small set of representative nodes in the network. These representative nodes constitute a network snapshot and can be used to provide quick approximate answers to user queries while reducing substantially the energy consumption in the network. We present a detailed experimental study of our framework and algorithms, varying multiple parameters like the available memory of the sensor nodes, their transmission range, the network message loss etc. Depending on the configuration, snapshot queries provide a reduction of up to 90% in the number of nodes that need to participate in a user query. ICDE Data Stream Query Processing. Nick Koudas,Divesh Srivastava 2005 This tutorial provides a comprehensive and cohesive overview of the key research results in the area of data stream query processing, both for SQL-like and XML query languages. ICDE Personalized Queries under a Generalized Preference Model. Georgia Koutrika,Yannis E. Ioannidis 2005 Query Personalization is the process of dynamically enhancing a query with related user preferences stored in a user profile with the aim of providing personalized answers. The underlying idea is that different users may find different things relevant to a search due to different preferences. Essential ingredients of query personalization are: (a) a model for representing and storing preferences in user profiles, and (b) algorithms for the generation of personalized answers using stored preferences. Modeling the plethora of preference types is a challenge. In this paper, we present a preference model that combines expressivity and concision. In addition, we provide efficient algorithms for the selection of preferences related to a query, and an algorithm for the progressive generation of personalized results, which are ranked based on user interest. Several classes of ranking functions are provided for this purpose. We present results of experiments both synthetic and with real users (a) demonstrating the efficiency of our algorithms, (b) showing the benefits of query personalization, and (c) providing insight as to the appropriateness of the proposed ranking functions. ICDE Improving Data Accessibility For Mobile Clients Through Cooperative Hoarding. Kwong Yuen Lai,Zahir Tari,Peter Bertók 2005 In this paper, we introduce the concept of cooperative hoarding to reduce the risks of cache misses for mobile clients. Cooperative hoarding takes advantage of group mobility behaviour, combined with peer cooperation in ad-hoc mode, to improve hoard performance. Two cooperative hoarding approaches that take into account clientsý access frequencies, connection probabilities and cache size when performing hoarding are proposed. Test results show that the proposed methods significantly improve cache hit ratio and reduce query costs compared to existing approaches. ICDE XML Views as Integrity Constraints and their Use in Query Translation. Rajasekar Krishnamurthy,Raghav Kaushik,Jeffrey F. Naughton 2005 "The SQL queries produced in XML-to-SQL query translation are often unnecessarily complex, even for simple input XML queries. In this paper we argue that relational systems can do a better job of XML-to-SQL query translation with the addition of a simple new constraint, which we term the ""lossless from XML"" constraint. Intuitively, this constraint states that a given relational data set resulted from the shredding of an XML document that conformed to a given schema. We illustrate the power of this approach by giving an algorithm that exploits the ""lossless from XML"" constraint to translate path expression queries into efficient SQL, even in the presence of recursive XML schemas. We argue that this approach is likely to be simpler and more effective than the current state of the art in optimizingXML-to-SQL query translation, which involves identifying and declaring multiple complex relational constraints and then reasoning about relational query containment in the presence of these constraints." ICDE Towards an Industrial Strength SQL/XML Infrastructure. Muralidhar Krishnaprasad,Zhen Hua Liu,Anand Manikutty,James W. Warner,Vikas Arora 2005 XML has become an attractive data processing model for applications. SQL/XML is a SQL standard that integrates XML with SQL. It introduces the XML datatype as a native SQL datatype and defines XML generation functions in the SQL/XML 2003 standard. The goal for the next version of SQL/XML is integrating XQuery with SQL by supporting XQuery embedded inside SQL functions such as the XMLQuery and XMLTable functions. Starting with the 9i database release, Oracle has supported the XML datatype and various operations on XML instances. In this paper, we present the design and implementation strategies of the SQL/XML standard in Oracle XMLDB. We explore the various critical infrastructures needed in the SQL database kernel to support an efficient native XML datatype implementation and the design approaches for efficient generation, query and update of the XML instances. Furthermore, we also illustrate extensions to SQL/XML that makes Oracle XMLDB a truly industrial strength platform for XML processing. ICDE Filter Based Directory Replication and Caching: Algorithms and Performance. Apurva Kumar 2005 Filter Based Directory Replication and Caching: Algorithms and Performance. ICDE DSI: A Fully Distributed Spatial Index for Wireless Data Broadcast. Wang-Chien Lee,Baihua Zheng 2005 DSI: A Fully Distributed Spatial Index for Wireless Data Broadcast. ICDE Load and Network Aware Query Routing for Information Integration. Wen-Syan Li,Vishal S. Batra,Vijayshankar Raman,Wei Han,K. Selçuk Candan,Inderpal Narang 2005 "Current federated systems deploy cost-based query optimization mechanisms; i.e., the optimizer selects a global query plan with the lowest cost to execute. Thus, cost functions influence what remote sources (i.e. equivalent data sources) to access and how federated queries are processed. In most federated systems, the underlying cost model is based on database statistics and query statements; however, the system load of remote sources and the dynamic nature of the network latency in wide area networks are not considered. As a result, federated query processing solutions can not adapt to runtime environment changes, such as network congestion or heavy workloads at remote sources. We present a novel system architecture that deploys a Query Cost Calibrator to calibrate the cost function based on system load and network latency at the remote sources and consequently indirectly ""influences"" query routing and load distribution in federated information systems." ICDE Mining Evolving Customer-Product Relationships in Multi-Dimensional Space. Xiaolei Li,Jiawei Han,Xiaoxin Yin,Dong Xin 2005 Previous work on mining transactional database has focused primarily on mining frequent itemsets, association rules, and sequential patterns. However, interesting relationships between customers and items, especially their evolution with time, have not been studied thoroughly. In this paper, we propose a Gaussian transformation-based regression model that captures time-variant relationships between customers and products. Moreover, since it is interesting to discover such relationships in a multi-dimensional space, an efficient method has been developed to compute multi-dimensional aggregates of such curves in a data cube environment. Our experimental results have demonstrated the promise of the approach. ICDE Stabbing the Sky: Efficient Skyline Computation over Sliding Windows. Xuemin Lin,Yidong Yuan,Wei Wang,Hongjun Lu 2005 We consider the problem of efficiently computing the skyline against the most recent N elements in a data stream seen so far. Specifically, we study the n-of-N skyline queries; that is, computing the skyline for the most recent n (¿ ¿ N) elements. Firstly, we developed an effective pruning technique to minimize the number of elements to be kept. It can be shown that on average storing only O(log^d N) elements from the most recent N elements is sufficient to support the precise computation of all n-of-N skyline queries in a d-dimension space if the data distribution on each dimension is independent. Then, a novel encoding scheme is proposed, together with efficient update techniques, for the stored elements, so that computing an n-of-N skyline query in a d-dimension space takes O(log N + s) time that is reduced to O(d log log N + s) if the data distribution is independent, where s is the number of skyline points. Thirdly, a novel trigger based technique is provided to process continuous n-of-N skyline queries with O(¿) time to update the current result per new data element and O(log s) time to update the trigger list per result change, where ¿ is the number of element changes from the current result to the new result. Finally, we extend our techniques to computing the skyline against an arbitrary window in the most recent N elements. Besides theoretical performance guarantees, our extensive experiments demonstrated that the new techniques can support on-line skyline query computation over very rapid data streams. ICDE Cost-Driven General Join View Maintenance over Distributed Data Sources. Bin Liu,Elke A. Rundensteiner 2005 Maintainingmaterialized views that have join conditions between arbitrary pairs of data sources possibly with cycles is critical for many applications. In this work, we model view maintenance as the process of answering a set of inter-related distributed multi-join queries. We illustrate two strategies for maintaining as well as optimizing such general join views. We propose a cost-driven view maintenance framework which generates optimized maintenance plans tuned to a given environmental settings. This framework can significantly improve view maintenance performance especially in a distributed environment. ICDE Data Mining Techniques for Microarray Datasets. Lei Liu,Jiong Yang,Anthony K. H. Tung 2005 Data Mining Techniques for Microarray Datasets. ICDE Increasing the Accuracy and Coverage of SQL Progress Indicators. Gang Luo,Jeffrey F. Naughton,Curt J. Ellmann,Michael Watzke 2005 Recently, progress indicators have been proposed for long-running SQL queries in RDBMSs. Although the proposed techniques work well for a subset of SQL queries, they are preliminary in the sense that (1) they cannot provide non-trivial estimates for some SQL queries, and (2) the provided estimates can be rather imprecise in certain cases. In this paper, we consider the problem of supporting non-trivial progress indicators for a wider class of SQL queries with more precise estimates. We present a set of techniques in achieving this goal. We report an initial implementation of these techniques in PostgreSQL. ICDE Corpus-based Schema Matching. Jayant Madhavan,Philip A. Bernstein,AnHai Doan,Alon Y. Halevy 2005 Schema Matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack ofsufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains. ICDE Bypass Caching: Making Scientific Databases Good Network Citizens. Tanu Malik,Randal C. Burns,Amitabh Chaudhary 2005 Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selections at each database. We develop the bypass-yield formulation of caching, which reduces network traffic in wide-area database federations, while preserving parallelism and data reduction. Bypass-yield caching is altruistic; caches minimize the overall network traffic generated by the federation, rather than focusing on local performance. We present an adaptive, workload-driven algorithm for managing a bypass-yield cache. We also develop on-line algorithms that make no assumptions about workload: a k-competitive deterministic algorithm and a randomized algorithm with minimal space complexity. We verify the efficacy of bypass-yield caching by running workload traces collected from the Sloan Digital Sky Survey through a prototype implementation. ICDE Configurable Security Protocols for Multi-party Data Analysis with Malicious Participants. Bradley Malin,Edoardo Airoldi,Samuel Edoho-Eket,Yiheng Li 2005 Standard multi-party computation models assume semi-honest behavior, where the majority of participants implement protocols according to specification, an assumption not always plausible. In this paper we introduce a multi-party protocol for collaborative data analysis when participants are malicious and fail to follow specification. The protocol incorporates a semi-trusted third party, which analyzes encrypted data and provides honest responses that only intended recipients can successfully decrypt. The protocol incorporates data confidentiality by enabling participants to receive encrypted responses tailored to their own encrypted data submissions without revealing plaintext to other participants, including the third party. As opposed to previous models, trust need only be placed on a single participant with no data at stake. Additionally, the proposed protocol is configurable in a way that security features are controlled by independent subprotocols. Various combinations of subprotocols allow for a flexible security system, appropriate for a number of distributed data applications, such as secure list comparison. ICDE Predicate Derivation and Monotonicity Detection in DB2 UDB. Timothy Malkemus,Sriram Padmanabhan,Bishwaranjan Bhattacharjee,Leslie Cranston 2005 DB2 Universal Database allows database schema designers to specify generated columns. These generated columns are useful for maintaining rollup hierarchy variables in warehouses (e.g., date, month, quarter). In order for the generated columns to be useful for query processing, queries must automatically make use of such columns when applicable. In particular, query predicates on the original columns should be rewritten to make use of the generated columns. In this paper, we describe two main aspects of this predicate rewriting technique that allows usage of the generated columns for a variety of query predicate types. The first aspect, monotonicity detection, allows for rewrites in the case of range predicates. The second aspect, predicate derivation, is the technique for using generating expressions for query processing. We show the value of this technique for providing significant performance improvement when combined with indexing or multidimensional clustering in DB2. ICDE Finding (Recently) Frequent Items in Distributed Data Streams. Amit Manjhi,Vladislav Shkapenyuk,Kedar Dhamdhere,Christopher Olston 2005 We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Na¨ýve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than na¨ýve approaches while providing the same error guarantees on answers. ICDE XQuery Midflight: Emerging Database-Oriented Paradigms and a Classification of Research Advances. Ioana Manolescu,Yannis Papakonstantinou 2005 "XQuery processing is one of the prime research topics of the database community. At the same time, XQuery research is still in a ""pre-paradigmatic"" stage, where the conventional symptoms of the stage are observed: It is hard to piece together point efforts into a big picture. Similarities and interplay opportunities between parallel efforts are ""lost in the translation"" across the different paradigms. The goal of this tutorial is to federate the plethora of works, and categorize existing work and future topics along a few reference paradigms that fuse existing results around a reference architecture." ICDE Adaptive Processing of Top-K Queries in XML. Amélie Marian,Sihem Amer-Yahia,Nick Koudas,Divesh Srivastava 2005 The ability to compute top-k matches to XML queries is gaining importance due to the increasing number of large XML repositories. The efficiency of top-k query evaluation relies on using scores to prune irrelevant answers as early as possible in the evaluation process. In this context, evaluating the same query plan for all answers might be too rigid because, at any time in the evaluation, answers have gone through the same number and sequence of operations, which limits the speed at which scores grow. Therefore, adaptive query processing that permits different plans for different partial matches and maximizes the best scores is more appropriate. In this paper, we propose an architecture and adaptive algorithms for efficiently computing top-k matches to XML queries. Our techniques can be used to evaluate both exact and approximate matches where approximation is defined by relaxing XPath axes. In order to compute the scores of query answers, we extend the traditional tf*idf measure to account for document structure. We conduct extensive experiments on a variety of benchmark data and queries, and demonstrate the usefulness of the adaptive approach for computing top-k queries in XML. ICDE Uncovering Database Access Optimizations in the Middle Tier with TORPEDO. Bruce E. Martin 2005 "A popular architecture for enterprise applications is one of a stateless object-based server accessing persistent data through Object-Relational mapping software. The reported benefits of usingObject-Relational mapping software are increased developer productivity, greater database portability and improved runtime performance over hand-written SQL due to caching. In spite of these supposed benefits, many software architects are suspicious of the ""black box"" nature of O-R mapping software. Discerning how O-R mapping software actually accesses a database is difficult. The Testbed of Object Relational Products for Enterprise Distributed Objects (TORPEDO) is designed to reveal the sophistication of O-R mapping software in accessing databases in single server and clustered environments. TORPEDO defines a set of realistic application level operations that detect significant set of database access optimizations. TORPEDO supports two standard Java APIs for O-R mapping, namely, Container Managed Persistence (CMP 2.0) and Java Data Objects (JDO). TORPEDO also supports the TopLink and Hibernate APIs. There are dozens of commercial and open-source O-R mapping products supporting these APIs. Results from running TORPEDO on different O-R mapping systems are comparable. We provide sample results from running TORPEDO on popular O-R mapping solutions. We describe why the optimizations TORPEDO reveals are important and how the application level operations detect the optimizations" ICDE Integrating Data from Disparate Sources: A Mass Collaboration Approach. Robert McCann,Alexander Kramnik,Warren Shen,Vanitha Varadarajan,Olu Sobulo,AnHai Doan 2005 Integrating Data from Disparate Sources: A Mass Collaboration Approach. ICDE Improving Preemptive Prioritization via Statistical Characterization of OLTP Locking. David T. McWherter,Bianca Schroeder,Anastassia Ailamaki,Mor Harchol-Balter 2005 "OLTP and transactional workloads are increasingly common in computer systems, ranging from e-commerce to warehousing to inventory management. It is valuable to provide priority scheduling in these systems, to reduce the response time for the most important clients, e.g. the ""big spenders"". Two-phase locking, commonly used in DBMS, makes prioritization difficult, as transactions wait for locks held by others regardless of priority. Common lock scheduling solutions, including non-preemptive priority inheritance and preemptive abort, have performance drawbacks for TPC-C type workloads. The contributions of this paper are two-fold: (i) We provide a detailed statistical analysis of locking in TPC-C workloads with priorities under several common preemptive and non-preemptive lock prioritization policies. We determine why non-preemptive policies fail tosufficiently help high-priority transactions, and why pre-emptive policies excessively hurt low-priority transactions. (ii) We propose and implement a policy, POW, that provides all the benefits of preemptive prioritization without its penalties." ICDE A Multiresolution Symbolic Representation of Time Series. Vasileios Megalooikonomou,Qiang Wang,Guo Li,Christos Faloutsos 2005 Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and non-trivial problem. In this paper, we introduce a new representation of time series, the Multiresolution Vector Quantized (MVQ) approximation, along with a new distance function. The novelty of MVQ is that it keeps both local and global information about the original time series in a hierarchical mechanism, processing the original time series at multiple resolutions. Moreover, the proposed representation is symbolic employing key subsequences and potentially allows the application of text-based retrieval techniques into the similarity analysis of time series. The proposed method is fast and scales linearly with the size of database and the dimensionality. Contrary to the vast majority in the literature that uses the Euclidean distance, MVQ uses a multi-resolution/hierarchical distance function. We performed experiments with real and synthetic data. The proposed distance function consistently outperforms all the major competitors (Euclidean, Dynamic Time Warping, Piecewise Aggregate Approximation) achieving up to 20% better precision/recall and clustering accuracy on the tested datasets. ICDE Bootstrapping Semantic Annotations for Content-Rich HTML Documents. Saikat Mukherjee,I. V. Ramakrishnan,Amarjeet Singh 2005 Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety ofWeb sources. We also present experimental results on the effectiveness of the technique. ICDE On the Optimal Ordering of Maps and Selections under Factorization. Thomas Neumann,Sven Helmer,Guido Moerkotte 2005 The query optimizer of a database system is confronted with two aspects when handling user-defined functions (UDFs) in query predicates: the vast differences in evaluation costs between UDFs (and other functions) and multiple calls of the same (expensive) UDF. The former is dealt with by ordering the evaluation of the predicates optimally, the latter by identifying common subexpressions and thereby avoiding costly recomputation. Current approaches order n predicates optimally (neglecting factorization) in O(n log n). Their result may deviate significantly from the optimal solution under factorization. We formalize the problem of finding optimal orderings under factorization and prove that it is NP-hard. Furthermore, we show how to improve on the run time of the brute-force algorithm (which computes all possible orderings) by presenting different enhanced algorithms. Although in the worst case these algorithms obviously still behave exponentially, our experiments demonstrate that for real-life examples their performance is much better. ICDE Reverse Nearest Neighbors in Large Graphs. Man Lung Yiu,Dimitris Papadias,Nikos Mamoulis,Yufei Tao 2005 A reverse nearest neighbor query returns the data objects that have a query point as their nearest neighbor. Although such queries have been studied quite extensively in Euclidean spaces, there is no previous work in the context of large graphs. In this paper, we propose algorithms and optimization techniques for RNN queries by utilizing some characteristics of networks. ICDE SemCast: Semantic Multicast for Content-based Data Dissemination. Olga Papaemmanouil,Ugur Çetintemel 2005 We address the problem of content-based dissemination of highly-distributed, high-volume data streams for stream-based monitoring applications and large-scale data delivery. Existing content-based dissemination approaches commonly rely on distributed filtering trees that require filtering at all brokers on the tree. We present a new semantic multicast approach that eliminates the need for content-based filtering at interior brokers and facilitates fine-grained control over the construction of efficient dissemination trees. The central idea is to split the incoming data streams (based on their contents, rates, and destinations) and then spread the pieces across multiple channels, each of which is implemented as an independent dissemination tree. We present the basic design and evaluation of SemCast, an overlay-network based system that implements this semantic multicast approach. Through a detailed simulation study and realistic network topologies, we demonstrate that SemCast significantly improves the efficiency of dissemination compared to traditional approaches. ICDE Mining Cross-Graph Quasi-Cliques in Gene Expression and Protein Interaction Data. Jian Pei,Daxin Jiang,Aidong Zhang 2005 Mining Cross-Graph Quasi-Cliques in Gene Expression and Protein Interaction Data. ICDE Compressing Bitmap Indices by Data Reorganization. Ali Pinar,Tao Tao,Hakan Ferhatosmanoglu 2005 Many scientific applications generate massive volumes of data through observations or computer simulations, bringing up the need for effective indexing methods for efficient storage and retrieval of scientific data. Unlike conventional databases, scientific data is mostly read-only and its volume can reach to the order of petabytes, making a compact index structure vital. Bitmap indexing has been successfully applied to scientific databases by exploiting the fact that scientific data are enumerated or numerical. Bitmap indices can be compressed with variants of run length encoding for a compact index structure. However even this may not be enough for the enormous data generated in some applications such as high energy physics. In this paper, we study how to reorganize bitmap tables for improved compression rates. Our algorithms are used just as a preprocessing step, thus there is no need to revise the current indexing techniques and the query processing algorithms. We introduce the tuple reordering problem, which aims to reorganize database tuples for optimal compression rates. We propose Gray code ordering algorithm for this NP-Complete problem, which is an inplace algorithm, and runs in linear time in the order of the size of the database. We also discuss how the tuple reordering problem can be reduced to the traveling salesperson problem. Our experimental results on real data sets show that the compression ratio can be improved by a factor of 2 to 10. ICDE A Relationally Complete Visual Query Language for Heterogeneous Data Sources and Pervasive Querying. Stavros Polyviou,George Samaras,Paraskevas Evripidou 2005 In this paper we introduce and formally define Query by Browsing (QBB), a scalable, relationally complete visual query language based on the desktop user interface paradigm and tuple relational calculus that allows the formulation of complex queries over relational, entity-relationship, object-oriented and XML data sources on a variety of handheld and desktop platforms. It is to our knowledge the first visual query language to combine the important characteristics of usability, scalability, expressive power and flexibility. We support these claims by demonstrating the similarity of the QBB paradigm to the popular desktop user interface paradigm, by relating it to relational calculus and relational algebra and by describing Chiromancer II, a web-based implementation of the QBB paradigm for handheld devices. We also discuss ways in which non-relational sources can be represented and queried and compare QBB to related work in the area of visual query languages for a variety of data models. We finally offer conclusions and thoughts for future work. ICDE SVL: Storage Virtualization Engine Leveraging DBMS Technology. Lin Qiao,Balakrishna R. Iyer,Divyakant Agrawal,Amr El Abbadi 2005 The demands on storage systems are increasingly requiring expressiveness, fault-tolerance, security, distribution, etc. Such functionalities have been traditionally provided by DBMS. We propose a storage management system, SVL, that leverages DBMS technology. The primary problem in block storage management is block virtualization which is essentially an abstraction layer that separates the user view of storage from the implementation of storage. Storage virtualization standardizes storage management in a heterogeneous storage and/or host environment, and plays a crucial role in enhancing storage functionality and utilization. Currently specialized hardware or microcode based solutions are popular for implementing block storage management systems, commonly referred to as disk controllers. We demonstrate how to take a general purpose commercial RDBMS, rather than a specialized solution, to support block storage management. We exploit the simple semantics of storage management systems to streamline database performance and thus attain acceptability from a storage point of view. This work promises to pave the way for diverse and innovative industrial applications of database management systems. ICDE IMAX: The Big Picture of Dynamic XML Statistics. Maya Ramanath,Lingzhi Zhang,Juliana Freire,Jayant R. Haritsa 2005 IMAX: The Big Picture of Dynamic XML Statistics. ICDE Adaptive Process Management with ADEPT2. Manfred Reichert,Stefanie Rinderle,Ulrich Kreher,Peter Dadam 2005 Adaptive Process Management with ADEPT2. ICDE Data Triage: An Adaptive Architecture for Load Shedding in TelegraphCQ. Frederick Reiss,Joseph M. Hellerstein 2005 Many of the data sources used in stream query processing are known to exhibit bursty behavior. Data in a burst often has different characteristics than steady-state data, and therefore may be of particular interest. In this paper, we describe the Data Triage architecture that we are adding to TelegraphCQ to provide low latency results with good accuracy under such bursts. ICDE Efficient Data Management on Lightweight Computing Device. Rajkumar Sen,Krithi Ramamritham 2005 Lightweight computing devices are becoming ubiquitous and an increasing number of applications are being developed for these devices. Many of these applications deal with significant amounts of data and involve complex joins and aggregate operations which necessitate a local database management system on the device. This is a challenge as these devices are constrained by limited stable storage and main memory. Hence new storage models that reduce storage costs are needed and a storage scheme should be selected based on data characteristics, nature of queries, and updates. Also, query execution plan should be chosen depending on the amount of available memory and the underlying storage scheme; memory should be optimally allocated among the database operators involved in the query. To achieve these goals, we utilize a novel storage model, ID based Storage, which reduces storage costs considerably. We present an exact algorithm for allocating memory among the database operators. Because of its high complexity, we also propose a heuristic solution based on the benefit of an operator per unit memory allocation. ICDE AutoLag: Automatic Discovery of Lag Correlations in Stream Data. Yasushi Sakurai,Spiros Papadimitriou,Christos Faloutsos 2005 AutoLag: Automatic Discovery of Lag Correlations in Stream Data. ICDE Triggers over XML views of relational data. Feng Shao,Antal Novak,Jayavel Shanmugasundaram 2005 Triggers over XML views of relational data. ICDE Towards Exploring Interactive Relationship between Clusters and Outliers in Multi-Dimensional Data Analysis. Yong Shi,Aidong Zhang 2005 Nowadays many data mining algorithms focus on clustering methods. There are also a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat clusters and outliers as concepts of the same importance in data analysis. In this paper, we present a cluster-outlier iterative detection algorithm, tending to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields such as pattern recognition, data clustering and signal processing. Experimental results demonstrate the advantages of our approach. ICDE Efficient Creation and Incremental Maintenance of the HOPI Index for Complex XML Document Collections. Ralf Schenkel,Anja Theobald,Gerhard Weikum 2005 The HOPI index, a connection index for XML documents based on the concept of a 2-hop cover, provides space- and time-efficient reachability tests along the ancestor, descendant, and link axes to support path expressions with wildcards in XML search engines. This paper presents enhanced algorithms for building HOPI, shows how to augment the index with distance information, and discusses incremental index maintenance. Our experiments show substantial improvements over the existing divide-and-conquer algorithm for index creation, low space overhead for including distance information in the index, and efficient updates. ICDE Evaluation of Spatio-Temporal Predicates on Moving Objects. Markus Schneider 2005 Moving objects databases managing spatial objects with continuously changing position and extent over time have recently found large interest in the database community. Queries about moving objects become particularly interesting when they ask for temporal changes in the topological relationships between evolving spatial objects. A concept of spatio-temporal predicates has been proposed to describe these relationships. The goal of this paper is to designefficient algorithms for them so that they can be used in spatio-temporal joins and selections. This paper proposes not to design an algorithm for each new predicate individually but to employ a generic algorithmic scheme which is able to cover present and future predicate definitions. ICDE Postgres-R(SI): Combining Replica Control with Concurrency Control based on Snapshot Isolation. Shuqing Wu,Bettina Kemme 2005 Replicating data over a cluster of workstations is a powerful tool to increase performance, and provide fault-tolerance for demanding database applications. The big challenge in such systems is to combine replica control (keeping the copies consistent) with concurrency control. Most of the research so far has focused on providing the traditional correctness criteria serializability. However, more and more database systems, e.g., Oracle and PostgreSQL, use multi-version concurrency control providing the isolation level snapshot isolation. In this paper, we present Postgres-R(SI), an extension of PostgreSQL offering transparent replication. Our replication tool is designed to work smoothly with PostgreSQLýs concurrency control providing snapshot isolation for the entire replicated system. We present a detailed description of the replica control algorithm, and how it is combined with PostgreSQLýs concurrency control component. Furthermore, we discuss some challenges we encountered when implementing the protocol. Our performance analysis based on the TPC-W benchmark shows that this approach exhibits excellent performance for real-life applications even if they are update intensive. ICDE SNAP: Efficient Snapshots for Back-in-Time Execution. Liuba Shrira,Hao Xu 2005 "SNAP is a novel high-performance snapshot system for object storage systems. The goal is to provide a snapshot service that is efficient enough to permit ""back-in-time"" read-only activities to run against application-specified snapshots. Such activities are often impossible to run against rapidly evolving current state because of interference or because the required activity is determined in retrospect. A key innovation in SNAP is that it provides snapshots that are transactionally consistent, yet non-disruptive. Unlike earlier systems, we use novel in-memory data structures to ensure that frequent snapshots do not block applications from accessing the storage system, and do not cause unnecessary disk operations. SNAP takes a novel approach to dealing with snapshot meta-data using a new technique that supports both incremental meta-data creation and efficient meta-data reconstruction. We have implemented a SNAP prototype and analyzed its performance. Preliminary results show that providing snapshots for back-in-time activities has low impact on system performance even when snapshots are frequent." ICDE Top Five Data Challenges for the Next Decade. Patricia G. Selinger 2005 Top Five Data Challenges for the Next Decade. ICDE BOXes: Efficient Maintenance of Order-Based Labeling for Dynamic XML Data. Adam Silberstein,Hao He,Ke Yi,Jun Yang 2005 Order-based element labeling for tree-structured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well understood, less is known about how to maintain accurate labeling for dynamic XML documents, when elements and subtrees are inserted and deleted. Most existing approaches do not work well for arbitrary update patterns; they either produce unacceptably long labels or incur enormous relabeling costs. We present two novel I/O-efficient data structures, W-BOX and B-BOX, thatefficiently maintain labeling for large, dynamic XML documents. We show analytically and experimentally that both, despite consuming minimal amounts of storage, gracefully handle arbitrary update patterns without sacrificing lookupefficiency. The two structures together provide a nice tradeoff between update and lookup costs: W-BOX has logarithmic amortized update cost and constant worst-case lookup cost, while B-BOX has constant amortized update cost and logarithmic worst-case lookup cost. We further propose techniques to eliminate the lookup cost for read-heavy workloads. ICDE Optimizing ETL Processes in Data Warehouses. Alkis Simitsis,Panos Vassiliadis,Timos K. Sellis 2005 Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Usually, these processes must be completed in a certain time window; thus, it is necessary to optimize their execution time. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow. ICDE Maintaining Implicated Statistics in Constrained Environments. Yannis Sismanis,Nick Roussopoulos 2005 Aggregated information regarding implicated entities is critical for online applications like network management, traffic characterization or identifying patters of resource consumption. Recently there has been a flurry of research for online aggregation on streams (like quantiles, hot items, hierarchical heavy hitters) but surprizingly the problem of summarizing implicated information in stream data has received no attention. As an example, consider an IP-network and the implication source ¿ destination. Flash crowds, ¿ such as those that follow recent sport events (like the olympics) or seek information regarding catastrophic events ¿ or denial of service attacks direct a large volume of traffic from a huge number of sources to a very small number of destinations. In this paper we present novel randomized algorithms for monitoring such implications with constraints in both memory and processing power for environments like network routers. Our experiments demonstrate several factors of improvements over straightforward approaches. ICDE """One Size Fits All"": An Idea Whose Time Has Come and Gone (Abstract)." Michael Stonebraker,Ugur Çetintemel 2005 """One Size Fits All"": An Idea Whose Time Has Come and Gone (Abstract)." ICDE Online Latent Variable Detection in Sensor Networks. Jimeng Sun,Spiros Papadimitriou,Christos Faloutsos 2005 Sensor networks attract increasing interest, for a broad range of applications. Given a sensor network, one key issue becomes how to utilize it efficiently and effectively. In particular, how can we detect the underlying correlations (latent variables) among many co-evolving sensor measurements? Can we do it incrementally? We present a system that can (1) collect the measurements from the real wireless sensors; (2) process them in real-time; and (3) determine the correlations (latent variables) among the sensor streams on the fly. ICDE IC Tag Based Traceability: System and Solutions. Yoji Taniguchi,Nobutoshi Sagawa 2005 "An increasing number of companies want to improve product traceability for several reasons: to meet stricter government regulations about food and medical safety, to cope with ever-stronger consumer demands to know exactly what they are buying, and to improve and protect the company's brand value through more transparent business operations. Two aspects of traceability are technically important: (1) techniques for tracing the events associated with the goods a company handles at all necessary points of the business operation, possibly through the use of IC tags and tag readers; and (2) ways to store, manage, and use the collected logs of events either to cope with problems or to improve business processes. In this paper, we first review currently available traceability systems by considering examples from real-world situations. After that, we discuss the likely directions and possibilities of next-generation traceability systems." ICDE A Distributed Quadtree Index for Peer-to-Peer Settings. Egemen Tanin,Aaron Harwood,Hanan Samet 2005 A Distributed Quadtree Index for Peer-to-Peer Settings. ICDE Acceleration Technique of Snake-Shaped Regions Retrieval Method for Telematics Navigation Service System. Masaaki Tanizaki,Kishiko Maruyama,Shigeru Shimada 2005 Telematics services, which provide traffic information such as route guidance, congestion warnings, etc. via a wireless communication network, have spread recently. The demand is growing for graphical guide information to be provided in addition to the conventional service that provides text only guidance. To improve graphical service, we propose a new retrieval method. This method enables fast extraction of map objects within a Snake-Shaped Region (SSR) along a driving route from a geo-spatial database that stores map data without rectangular mesh boundaries. For this retrieval method, we have considered three techniques. The first is based on simplification of the Snake-Shaped route Region through point elimination, and the second is based on reduction of the processing load of the geometrical intersection detection processes. This second technique is accomplished by dividing the Snake-Shaped Region into multiple cells, and the third is multiple distributions of the SSR retrieval result to terminals for quick start of navigation processing. We have developed a prototype to evaluate the performance of the proposed methods. The prototype provides route guidance information for an actual terminal, and uses information taken from United States road maps. Even in an urban area, we managed to provide an approximately 200-mile route of guide information within 10 seconds. We are convinced that the proposed method can be applied to actual Telematics services. ICDE Venn Sampling: A Novel Prediction Technique for Moving Objects. Yufei Tao,Dimitris Papadias,Jian Zhai,Qing Li 2005 "Given a region q_R and a future timestamp q_T, a ""range aggregate"" query estimates the number of objects expected to appear in q_R at time q_T. Currently the only methods for processing such queries are based on spatio-temporal histograms, which have several serious problems. First, they consume considerable space in order to provide accurate estimation. Second, they incur high evaluation cost. Third, their efficiency continuously deteriorates with time. Fourth, their maintenance requires significant update overhead. Motivated by this, we develop Venn sampling (VS), a novel estimation method optimized for a set of ""pivot queries"" that reflect the distribution of actual ones. In particular, given m pivot queries, VS achieves perfect estimation with only O(m) samples, as opposed to O(2^m) required by the current state of the art in workload-aware sampling. Compared with histograms, our technique is much more accurate (given the same space), produces estimates with negligible cost, and does not deteriorate with time. Furthermore, it permits the development of a novel ""query-driven"" update policy, which reduces the update cost of conventional policies significantly." ICDE Range Efficient Computation of F over Massive Data Streams. Pavan Aduri,Srikanta Tirthapura 2005 Range Efficient Computation of F over Massive Data Streams. ICDE Distributed XML Stream Filtering System with High Scalability. Hiroyuki Uchiyama,Makoto Onizuka,Takashi Honishi 2005 We propose a distributed XML stream filtering system that uses a large number of subscribersý profiles, written in XPath expressions, to filter XML streams and then publish the filtered data in real-time. To realize the proposed system, we define XPath expression features on XML data and utilize them to forecast the serversý loads. Our method is realized by combining methods to share the total transfer loads of each filtering server and to equalize the sum of overlap size between filtering servers. Experiments show that the rate at which the publishing time increases with the number of XPath expressions is three times smaller in the proposed system than in the round-robin method. Furthermore, the overhead of the proposed method is quite low. ICDE Privacy-Preserving Top-K Queries. Jaideep Vaidya,Chris Clifton 2005 Privacy-Preserving Top-K Queries. ICDE Representing and Querying Data Transformations. Yannis Velegrakis,Renée J. Miller,John Mylopoulos 2005 "Modern information systems often store data that has been transformed and integrated from a variety of sources. This integration may obscure the original source semantics of data items. For many tasks, it is important to be able to determine not only where data items originated, but also why they appear in the integration as they do and through what transformation they were derived. This problem is known as data provenance. In this work, we consider data provenance at the schema and mapping level. In particular, we consider how to answer questions such as ""what schema elements in the source(s) contributed to this value"", or ""through what transformations or mappings was this value derived?"" Towards this end, we elevate schemas and mappings to first-class citizens that are stored in a repository and are associated with the actual data values. An extended query language, called MXQL, is also developed that allows meta-data to be queried as regular data and we describe its implementation. scenario." ICDE Conference Officers. 2005 Conference Officers. ICDE On the Sequencing of Tree Structures for XML Indexing. Haixun Wang,Xiaofeng Meng 2005 Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. In this paper, we address the problem of query equivalence with respect to this transformation, and we introduce a performance-oriented principle for sequencing tree structures. With query equivalence, XML queries can be performed through subsequence matching without join operations, post-processing, or other special handling for problems such as false alarms. We identify a class of sequencing methods for this purpose, and we present a novel subsequence matching algorithm that observe query equivalence. Still, query equivalence is just a prerequisite for sequence-based XML indexing. Our goal is to find the best sequencing strategy with regard to the time and space complexity in indexing and querying XML data. To this end, we introduce a performance-oriented principle to guide the sequencing of tree structures. For any given XML dataset, the principle finds an optimal sequencing strategy according to its schema and its data distribution. We present a novel method that realizes this principle. In our experiments, we show the advantages of sequence-based indexing over traditional XML indexing methods, and we compare several sequencing strategies and demonstrate the benefit of the performance-oriented sequencing principle. ICDE Online Mining of Data Streams: Applications, Techniques and Progress. Haixun Wang,Jian Pei,Philip S. Yu 2005 Online Mining of Data Streams: Applications, Techniques and Progress. ICDE Program Committee. 2005 Program Committee. ICDE Demo Program Committee. 2005 Demo Program Committee. ICDE External Reviewers. 2005 External Reviewers. ICDE Adaptive Lapped Declustering: A Highly Available Data-Placement Method Balancing Access Load and Space Utilization. Akitsugu Watanabe,Haruo Yokota 2005 Adaptive Lapped Declustering: A Highly Available Data-Placement Method Balancing Access Load and Space Utilization. ICDE Message from the General Chairs. 2005 Message from the General Chairs. ICDE Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features. Kyu-Young Whang,Min-Jae Lee,Jae-Gil Lee,Min-Soo Kim,Wook-Shin Han 2005 We propose the notion of tight-coupling [8] to add new data types into the DBMS engine. In this paper, we introduce the Odysseus ORDBMS and present its tightly-coupled IR features (U.S. patented). We demonstrate a web search engine capable of managing 20 million web pages in a non-parallel configuration using Odysseus. ICDE Welcome from the Program Chairs. 2005 Welcome from the Program Chairs. ICDE Dynamic Load Distribution in the Borealis Stream Processor. Ying Xing,Stanley B. Zdonik,Jeong-Hyon Hwang 2005 Distributed and parallel computing environments are becoming cheap and commonplace. The availability of large numbers of CPUýs makes it possible to process more data at higher speeds. Stream-processing systems are also becoming more important, as broad classes of applications require results in real-time. Since load can vary in unpredictable ways, exploiting the abundant processor cycles requires effective dynamic load distribution techniques. Although load distribution has been extensively studied for the traditional pull-based systems, it has not yet been fully studied in the context of push-based continuous query processing. In this paper, we present a correlation based load distribution algorithm that aims at avoiding overload and minimizing end-to-end latency by minimizing load variance and maximizing load correlation. While finding the optimal solution for such a problem is NP-hard, our greedy algorithm can find reasonable solutions in polynomial time. We present both a global algorithm for initial load distribution and a pair-wise algorithm for dynamic load migration. ICDE SEA-CNN: Scalable Processing of Continuous K-Nearest Neighbor Queries in Spatio-temporal Databases. Xiaopeng Xiong,Mohamed F. Mokbel,Walid G. Aref 2005 Location-aware environments are characterized by a large number of objects and a large number of continuous queries. Both the objects and continuous queries may change their locations over time. In this paper, we focus on continuous k-nearest neighbor queries (CKNN, for short). We present a new algorithm, termed SEA-CNN, for answering continuously a collection of concurrent CKNN queries. SEA-CNN has two important features: incremental evaluation and shared execution. SEA-CNN achieves bothefficiency and scalability in the presence of a set of concurrent queries. Furthermore, SEA-CNN does not make any assumptions about the movement of objects, e.g., the objects velocities and shapes of trajectories, or about the mutability of the objects and/or the queries, i.e., moving or stationary queries issued on moving or stationary objects. We provide theoretical analysis of SEA-CNN with respect to the execution costs, memory requirements and effects of tunable parameters. Comprehensive experimentation shows that SEA-CNN is highly scalable and is more efficient in terms of both I/O and CPU costs in comparison to other R-tree-based CKNN techniques. ICDE Scrutinizing Frequent Pattern Discovery Performance. Osmar R. Zaïane,Mohammad El-Hajj,Yi Li,Stella Luk 2005 Benchmarking technical solutions is as important as the solutions themselves. Yet many fields still lack any type of rigorous evaluation. Performance benchmarking has always been an important issue in databases and has played a significant role in the development, deployment and adoption of technologies. To help assessing the myriad algorithms for frequent itemset mining, we built an open framework and testbed to analytically study the performance of different algorithms and their implementations, and contrast their achievements given different data characteristics, different conditions, and different types of patterns to discover and their constraints. This facilitates reporting consistent and reproducible performance results using known conditions. ICDE CLICKS: Mining Subspace Clusters in Categorical Data via K-partite Maximal Cliques. Mohammed Javeed Zaki,Markus Peters 2005 We present a novel algorithm called CLICKS, that finds clusters in categorical datasets based on a search for k-partite maximal cliques. Unlike previous methods, CLICKS mines subspace clusters. It uses a selective vertical method to guarantee complete search. CLICKS outperforms previous approaches by over an order of magnitude and scales better than any of the existing method for high-dimensional datasets. We demonstrate this improvement in an excerpt from our comprehensive performance studies. ICDE Spatiotemporal Annotation Graph (STAG): A Data Model for Composite Digital Objects. Smriti Yamini,Amarnath Gupta 2005 In this demonstration, we present a database over complex documents, which, in addition to a structured text content, also has update information, annotations, and embedded objects. We propose a new data model called Spatio-temporal Annotation Graphs (STAG) for a database of composite digital objects and present a system that shows a query language to efficiently and effectively query such database. The particular application to be demonstrated is a database over annotated MS Word and PowerPoint presentations with embedded multimedia objects. ICDE BlossomTree: Evaluating XPaths in FLWOR Expressions. Ning Zhang,Shishir Agrawal,M. Tamer Özsu 2005 Efficient evaluation of path expressions has been studied extensively. However, evaluating more complex FLWOR expressions that contain multiple path expressions has not been well studied. In this paper, we propose a novel pattern matching approach, called BlossomTree, to evaluate a FLWOR expression that contains correlated path expressions. BlossomTree is a formalism to capture the semantics of the path expressions and their correlations. We propose a general algebraic framework (abstract data types and logical operators) to evaluate BlossomTreepattern matching that facilitates efficient evaluation and experimentation. We design efficient data structures and algorithms to implement the abstract data types and logical operators. Our experimental studies demonstrate that the BlossomTreeapproach can generate highly efficient query plans in different environments. ICDE Mining Closed Relational Graphs with Connectivity Constraints. Xifeng Yan,Xianghong Jasmine Zhou,Jiawei Han 2005 Relational graphs are widely used in modeling large scale networks such as biological networks and social networks. In this kind of graph, connectivity becomes critical in identifying highly associated groups and clusters. In this paper, we investigate the issues of mining closed frequent graphs with connectivity constraints in massive relational graphs where each graph has around 10K nodes and 1M edges. We adopt the concept of edge connectivity and apply the results from graph theory, to speed up the mining process. Two approaches are developed to handle different mining requests: CloseCut, a pattern-growth approach, and splat, a pattern-reduction approach. We have applied these methods in biological datasets and found the discovered patterns interesting. ICDE XGuard: A System for Publishing XML Documents without Information Leakage in the Presence of Data Inference. Xiaochun Yang,Chen Li,Ge Yu,Lei Shi 2005 XGuard: A System for Publishing XML Documents without Information Leakage in the Presence of Data Inference. ICDE Dynamic Load Management for Distributed Continuous Query Systems. Yongluan Zhou,Beng Chin Ooi,Kian-Lee Tan 2005 Dynamic Load Management for Distributed Continuous Query Systems. ICDE Sentiment Mining in WebFountain. Jeonghee Yi,Wayne Niblack 2005 WebFountain is a platform for very large-scale text analytics applications that allows uniform access to a wide variety of sources. It enables the deployment of a variety of document-level and corpus-level miners in a scalable manner, and feeds information that drives end-user applications through a set of hosted Web services. Sentiment (or opinion) mining is one of the most useful analyses for various end-user applications, such as reputation management. Instead of classifying the sentiment of an entire document about a subject, our sentiment miner determines sentiment of each subject reference using natural language processing techniques. In this paper, we describe the fully functional system environment and the algorithms, and report the performance of the sentiment miner. The performance of the algorithms was verified on online product review articles, and more general documents including Web pages and news articles. ICDE DUP: Dynamic-tree Based Update Propagation in Peer-to-Peer. Liangzhong Yin,Guohong Cao 2005 DUP: Dynamic-tree Based Update Propagation in Peer-to-Peer. ICDE On Discovery of Extremely Low-Dimensional Clusters using Semi-Supervised Projected Clustering. Kevin Y. Yip,David W. Cheung,Michael K. Ng 2005 Recent studies suggest that projected clusters with extremely low dimensionality exist in many real datasets. A number of projected clustering algorithms have been proposed in the past several years, but few can identify clusters with dimensionality lower than 10% of the total number of dimensions, which are commonly found in some real datasets such as gene expression profiles. In this paper we propose a new algorithm that can accurately identify projected clusters with relevant dimensions as few as 5% of the total number of dimensions. It makes use of a robust objective function that combines object clustering and dimension selection into a single optimization problem. The algorithm can also utilize domain knowledge in the form of labeled objects and labeled dimensions to improve its clustering accuracy. We believe this is the first semi-supervised projected clustering algorithm. Both theoretical analysis and experimental results show that by using a small amount of input knowledge, possibly covering only a portion of the underlying classes, the new algorithm can be further improved to accurately detect clusters with only 1% of the dimensions being relevant. The algorithm is also useful in getting a target set of clusters when there are multiple possible groupings of the objects. ICDE Monitoring K-Nearest Neighbor Queries Over Moving Objects. Xiaohui Yu,Ken Q. Pu,Nick Koudas 2005 Many location-based applications require constant monitoring of k-nearest neighbor (k-NN) queries over moving objects within a geographic area. Existing approaches to this problem have focused on predictive queries, and relied on the assumption that the trajectories of the objects are fully predictable at query processing time. We relax this assumption, and propose two efficient and scalable algorithms using grid indices. One is based on indexing objects, and the other on queries. For each approach, a cost model is developed, and a detailed analysis along with the respective applicability are presented. The Object-Indexing approach is further extended to multi-levels to handle skewed data. We show by experiments that our grid-based algorithms significantly outperform R-tree-based solutions. Extensive experiments are also carried out to study the properties and evaluate the performance of the proposed approaches under a variety of settings. ICDE Deep Store: an Archival Storage System Architecture. Lawrence You,Kristal T. Pollack,Darrell D. E. Long 2005 We present the Deep Store archival storage architecture, a large-scale storage system that stores immutable dataefficiently and reliably for long periods of time. Archived data is stored across a cluster of nodes and recorded to hard disk. The design differentiates itself from traditional file systems by eliminating redundancy within and across files, distributing content for scalability, associating rich metadata with content, and using variable levels of replication based on the importance or degree of dependency of each piece of stored data. We evaluate the foundations of our design, including PRESIDIO, a virtual content-addressable storage framework with multiple methods for inter-file and intra-file compression that effectively addresses the data-dependent variability of data compression. We measure content and metadata storage efficiency, demonstrate the need for a variable-degree replication model, and provide preliminary results for storage performance. ICDE Enabling Ad-hoc Ranking for Data Retrieval. Hwanjo Yu,Seung-won Hwang,Kevin Chen-Chuan Chang 2005 Enabling Ad-hoc Ranking for Data Retrieval. ICDE Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan 2005 Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan SIGMOD Conference Query-Sensitive Embeddings. Vassilis Athitsos,Marios Hadjieleftheriou,George Kollios,Stan Sclaroff 2005 A common problem in many types of databases is retrieving the most similar matches to a query object. Finding these matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embedding methods can significantly speed-up retrieval by mapping objects into a vector space, where distances can be measured rapidly using a Minkowski metric. In this article we present a novel way to improve embedding quality. In particular, we propose to construct embeddings that use a query-sensitive distance measure for the target space of the embedding. This distance measure is used to compare those vectors that the query and database objects are mapped to. The term “query-sensitive” means that the distance measure changes, depending on the current query object. We demonstrate theoretically that using a query-sensitive distance measure increases the modeling power of embeddings and allows them to capture more of the structure of the original space. We also demonstrate experimentally that query-sensitive embeddings can significantly improve retrieval performance. In experiments with an image database of handwritten digits and a time-series database, the proposed method outperforms existing state-of-the-art non-Euclidean indexing methods, meaning that it provides significantly better tradeoffs between efficiency and retrieval accuracy. SIGMOD Conference Towards a Robust Query Optimizer: A Principled and Practical Approach. Brian Babcock,Surajit Chaudhuri 2005 Research on query optimization has focused almost exclusively on reducing query execution time, while important qualities such as consistency and predictability have largely been ignored, even though most database users consider these qualities to be at least as important as raw performance. In this paper, we explore how the query optimization process can be made more robust, focusing on the important subproblem of cardinality estimation. The robust cardinality estimation technique that we propose allows for a user- or application-specified trade-off between performance and predictability, and it captures multi-dimensional correlations while remaining space- and time-efficient. SIGMOD Conference Proactive Re-optimization. Shivnath Babu,Pedro Bizarro,David J. DeWitt 2005 Traditional query optimizers rely on the accuracy of estimated statistics to choose good execution plans. This design often leads to suboptimal plan choices for complex queries, since errors in estimates for intermediate subexpressions grow exponentially in the presence of skewed and correlated data distributions. Reoptimization is a promising technique to cope with such mistakes. Current re-optimizers first use a traditional optimizer to pick a plan, and then react to estimation errors and resulting suboptimalities detected in the plan during execution. The effectiveness of this approach is limited because traditional optimizers choose plans unaware of issues affecting reoptimization. We address this problem using proactive reoptimization, a new approach that incorporates three techniques: i) the uncertainty in estimates of statistics is computed in the form of bounding boxes around these estimates, ii) these bounding boxes are used to pick plans that are robust to deviations of actual values from their estimates, and iii) accurate measurements of statistics are collected quickly and efficiently during query execution. We present an extensive evaluation of these techniques using a prototype proactive re-optimizer named Rio. In our experiments Rio outperforms current re-optimizers by up to a factor of three. SIGMOD Conference Proactive re-optimization with Rio. Shivnath Babu,Pedro Bizarro,David J. DeWitt 2005 Traditional query optimizers rely on the accuracy of estimated statistics of intermediate subexpressions to choose good query execution plans. This design often leads to suboptimal plan choices for complex queries since errors in estimates grow exponentially in the presence of skewed and correlated data distributions. We propose to demonstrate the Rio prototype database system that uses proactive re-optimization to address the problems with traditional optimizers. Rio supports three new techniques:1. Intervals of uncertainty are considered around estimates of statistics during plan enumeration and costing2. These intervals are used to pick execution plans that are robust to deviations of actual values of statistics from estimated values, or to defer the choice of execution plan until the uncertainty in estimates can be resolved3. Statistics of intermediate subexpressions are collected quickly, accurately, and efficiently during query executionThese three features are fully functional in the current Rio prototype which is built using the Predator open-source DBMS [5]. In this proposal, we first describe the novel features of Rio, then we use an example query to illustrate the main aspects of our demonstration. SIGMOD Conference Database tuning advisor for microsoft SQL server 2005: demo. Sanjay Agrawal,Surajit Chaudhuri,Lubor Kollár,Arunprasad P. Marathe,Vivek R. Narasayya,Manoj Syamala 2005 "Database Tuning Advisor (DTA) is a physical database design tool that is part of Microsoft's SQL Server 2005 relational database management system. Previously known as ""Index Tuning Wizard"" in SQL Server 7.0 and SQL Server 2000, DTA adds new functionality that is not available in other contemporary physical design tuning tools. Novel aspects of DTA that will be demonstrated include: (a) Ability to take into account both performance and manageability requirements of DBAs (b) Fully integrated recommendations for indexes, materialized views and horizontal partitioning (c) Transparently leverage a test server to offload tuning load from production server and (d) Easy programmability and scriptability." SIGMOD Conference Privacy Preserving OLAP. Rakesh Agrawal,Ramakrishnan Srikant,Dilys Thomas 2005 We present techniques for privacy-preserving computation of multidimensional aggregates on data partitioned across multiple clients. Data from different clients is perturbed (randomized) in order to preserve privacy before it is integrated at the server. We develop formal notions of privacy obtained from data perturbation and show that our perturbation provides guarantees against privacy breaches. We develop and analyze algorithms for reconstructing counts of subcubes over perturbed data. We also evaluate the tradeoff between privacy guarantees and reconstruction accuracy and show the practicality of our approach. SIGMOD Conference Distributed operation in the Borealis stream processing engine. Yanif Ahmad,Bradley Berg,Ugur Çetintemel,Mark Humphrey,Jeong-Hyon Hwang,Anjali Jhingran,Anurag Maskey,Olga Papaemmanouil,Alex Rasin,Nesime Tatbul,Wenjuan Xing,Ying Xing,Stanley B. Zdonik 2005 Borealis is a distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora and inter-node communication functionality from Medusa.We propose to demonstrate some of the key aspects of distributed operation in Borealis, using a multi-player network game as the underlying application. The demonstration will illustrate the dynamic resource management, query optimization and high availability mechanisms employed by Borealis, using visual performance-monitoring tools as well as the gaming experience. SIGMOD Conference Schema and ontology matching with COMA++. David Aumueller,Hong Hai Do,Sabine Massmann,Erhard Rahm 2005 We demonstrate the schema and ontology matching tool COMA++. It extends our previous prototype COMA utilizing a composite approach to combine different match algorithms [3]. COMA++ implements significant improvements and offers a comprehensive infrastructure to solve large real-world match problems. It comes with a graphical interface enabling a variety of user interactions. Using a generic data representation, COMA++ uniformly supports schemas and ontologies, e.g. the powerful standard languages W3C XML Schema and OWL. COMA++ includes new approaches for ontology matching, in particular the utilization of shared taxonomies. Furthermore, different match strategies can be applied including various forms of reusing previously determined match results and a so-called fragment-based match approach which decomposes a large match problem into smaller problems. Finally, COMA++ cannot only be used to solve match problems but also to comparatively evaluate the effectiveness of different match algorithms and strategies. SIGMOD Conference The many roles of meta data in data integration. Philip A. Bernstein 2005 This paper is a short introduction to an industrial session on the use of meta data to address data integration problems in large enterprises. The main topics are data discovery, version and configuration management, and mapping development. SIGMOD Conference Fault-tolerance in the Borealis distributed stream processing system. Magdalena Balazinska,Hari Balakrishnan,Samuel Madden,Michael Stonebraker 2005 Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this article, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any further failures that occur during recovery. This article describes DPC and its implementation in the Borealis SPE. We show that DPC enables a distributed SPE to maintain low-latency processing at all times, while also achieving eventual consistency, where applications eventually receive the complete and correct output streams. Furthermore, we show that, independent of system size and failure location, it is possible to handle failures almost up-to the user-specified bound in a manner that meets the required availability without introducing any inconsistency. SIGMOD Conference Extending XQuery for Analytics. Kevin S. Beyer,Donald D. Chamberlin,Latha S. Colby,Fatma Özcan,Hamid Pirahesh,Yu Xu 2005 XQuery is a query language under development by the W3C XML Query Working Group. The language contains constructs for navigating, searching, and restructuring XML data. With XML gaining importance as the standard for representing business data, XQuery must support the types of queries that are common in business analytics. One such class of queries is OLAP-style aggregation queries. Although these queries are expressible in XQuery Version 1, the lack of explicit grouping constructs makes the construction of these queries non-intuitive and places a burden on the XQuery engine to recognize and optimize the implicit grouping constructs. Furthermore, although the flexibility of the XML data model provides an opportunity for advanced forms of grouping that are not easily represented in relational systems, these queries are difficult to express using the current XQuery syntax. In this paper, we provide a proposal for extending the XQuery FLWOR expression with explicit syntax for grouping and for numbering of results. We show that these new XQuery constructs not only simplify the construction and evaluation of queries requiring grouping and ranking but also enable complex analytic queries such as moving-window aggregation and rollups along dynamic hierarchies to be expressed without additional language extensions. SIGMOD Conference DB2/XML: designing for evolution. Kevin S. Beyer,Fatma Özcan,Sundar Saiprasad,Bert Van der Linden 2005 "DB2 provides native XML storage, indexing, navigation and query processing through both SQL/XML and XQuery using the XML data type introduced by SQL/XML. In this tutorial we focus on DB2's XML support for schema evolution, especially DB2's schema repository and document-level validation." SIGMOD Conference A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. Philip Bohannon,Michael Flaster,Wenfei Fan,Rajeev Rastogi 2005 A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD Conference Database issues for the 21st century. Adam Bosworth 2005 The Web has democratized and revolutionized computing in the last 10 years. Acting as a universal communications mechanism, it has let anyone talk to anyone (for example via email, IM and voip), anyone talk to any application (the Web), anyone talk to some very limited forms of information (Blogs/RSS), any application talk to anyone (Spam), and any application talk to any other application (Web Services). SIGMOD Conference Safe data sharing and data dissemination on smart devices. Luc Bouganim,Cosmin Cremarenco,François Dang Ngoc,Nicolas Dieu,Philippe Pucheral 2005 The erosion of trust put in traditional database servers and in Database Service Providers (DSP), the growing interest for different forms of data dissemination and the concern for protecting children from suspicious Internet content are different factors that lead to move the access control from servers to clients. Due to the intrinsic untrustworthiness of client devices, client-based access control solutions rely on data encryption. The data are kept encrypted at the server and a client is granted access to subparts of them according to the decryption keys in its possession. Several variations of this basic model have been proposed (e.g., [1, 6]) but they have in common to minimize the trust required on the client at the cost of a static way of sharing data. Indeed, whatever the granularity of sharing, the dataset is split in subsets reflecting a current sharing situation, each encrypted with a different key. Once the dataset is encrypted, changes in the access control rules definition may impact the subset boundaries, hence incurring a partial re-encryption of the dataset and a potential redistribution of keys. SIGMOD Conference MYSTIQ: a system for finding more answers by using probabilities. Jihad Boulos,Nilesh N. Dalvi,Bhushan Mandhani,Shobhit Mathur,Christopher Ré,Dan Suciu 2005 MystiQ is a system that uses probabilistic query semantics [3] to find answers in large numbers of data sources of less than perfect quality. There are many reasons why the data originating from many different sources may be of poor quality, and therefore difficult to query: the same data item may have different representation in different sources; the schema alignments needed by a query system are imperfect and noisy; different sources may contain contradictory information, and, in particular, their combined data may violate some global integrity constraints; fuzzy matches between objects from different sources may return false positives or negatives. Even in such environment, users some-times want to ask complex, structurally rich queries, using query constructs typically found in SQL queries: joins, subqueries, existential/universal quantifiers, aggregate and group-by queries: for example scientists may use such queries to query multiple scientific data sources, or a law enforcement agency may use it in order to find rare associations from multiple data sources. If standard query semantics were applied to such queries, all but the most trivial queries will return an empty answer. SIGMOD Conference XQBE: a visual environment for learning XML query languages. Daniele Braga,Alessandro Campi,Stefano Ceri,Alessandro Raffio 2005 XQBE (XQuery By Example) is a visual XML query language which, coherently with the hierarchical XML data model, uses tree-shaped structures to express queries and transformations over XML documents. These structures are annotated to express selection predicates; explicit bindings between the nodes of such structures visualize the input/output mappings.XQuery and XSLT, the standard query and transformation languages for XML, happen to be too complex for most occasional or unskilled users who might need to specify queries, schema mappings, or document transformations, if they are only aware of the basics of the XML data model. The implementation of XQBE allows to generate the XQuery and XSLT translations of the visual queries, assisting the user in several aspects of the interaction (e.g. providing interactive access to schema information); therefore, XQBE provides an integrated environment where users can edit the visual queries and their textual counterparts, executing them on several engines. Alternating among different representations of the same query is valuable for training beginners, as we have experienced in our database courses. SIGMOD Conference Model-driven design of service-enabled web applications. Marco Brambilla,Stefano Ceri,Piero Fraternali,Roberto Acerbis,Aldo Bongio 2005 Significant efforts are currently invested in application integration to enable the interaction and composition of business processes of different companies, yielding complex; multi-party processes. Web service standards, based on WSDL, have been adopted as a process-to-process communication paradigm. This paper presents an industrial experience in integrating data-intensive and process-intensive Web applications through Web services. Design of sites and of Web services interaction exploits modern Web engineering methods, including conceptual modeling, model verification, visual data marshalling and automatic code generation. In particular, the applied method is based on a declarative model for specifying data-intensive Web applications that enact complex interactions, driven by the user, with remote processes implemented as services. We describe the internal architecture of the CASE tool that has been used, and give an overview of three industrial applications developed with the described approach. SIGMOD Conference Cost-Sensitive Reordering of Navigational Primitives. Carl-Christian Kanne,Matthias Brantner,Guido Moerkotte 2005 We present a method to evaluate path queries based on the novel concept of partial path instances. Our method (1) maximizes performance by means of sequential scans or asynchronous I/O, (2) does not require a special storage format, (3) relies on simple navigational primitives on trees, and (4) can be complemented by existing logical and physical optimizations such as duplicate elimination, duplicate prevention and path rewriting.We use a physical algebra which separates those navigation operations that require I/O from those that do not. All I/O operations necessary for the evaluation of a path are isolated in a single operator, which may employ efficient I/O scheduling strategies such as sequential scans or asynchronous I/O.Performance results for queries from the XMark benchmark show that reordering the navigation operations can increase performance up to a factor of four. SIGMOD Conference Automatic Physical Database Tuning: A Relaxation-based Approach. Nicolas Bruno,Surajit Chaudhuri 2005 In recent years there has been considerable research on automated selection of physical design in database systems. In current solutions, candidate access paths are heuristically chosen based on the structure of each input query, and a subsequent bottom-up search is performed to identify the best overall configuration. To handle large workloads and multiple kinds of physical structures, recent techniques have become increasingly complex: they exhibit many special cases, shortcuts, and heuristics that make it very difficult to analyze and extract properties. In this paper we critically examine the architecture of current solutions. We then design a new framework for the physical design problem that significantly reduces the assumptions and heuristics used in previous approaches. While simplicity and uniformity are important contributions in themselves, we report extensive experimental results showing that our approach could result in comparable (and, in many cases, considerably better) recommendations than state-of-the-art commercial alternatives. SIGMOD Conference Personal information management with SEMEX. Yuhan Cai,Xin Luna Dong,Alon Y. Halevy,Jing Michelle Liu,Jayant Madhavan 2005 The explosion of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search and query tools. The problem is exacerbated by the proliferation of varied electronic devices (laptops, PDAs, cellphones) that are at our disposal, which often hold subsets or variations of our data. In fact, several recent venues have noted Personal Information Management (PIM) as an area of growing interest to the data management community [1, 8, 6] SIGMOD Conference Service Oriented Database Architecture: APP server-lite? David Campbell 2005 As the capabilities and service levels of enterprise database systems have evolved, they have collided with incumbent technologies such as TP-Monitors or Message Oriented Middleware (MOM). We believe this trend will continue and have architected the upcoming release of SQL Server to advance this technology trend. This paper describes the Service Oriented Database Architecture (SODA) developed for the Microsoft SQL Server DBMS. First, it motivates the need for building Service Oriented Architecture (SOA) features directly into a database engine. Second, it describes a set of features in SQL Server that have been designed for SOA use. Finally, it concludes with some thoughts on how SODA can enable multiple service deployment topologies. SIGMOD Conference A Nested Relational Approach to Processing SQL Subqueries. Bin Cao,Antonio Badia 2005 One of the most powerful features of SQL is the use of nested queries. Most research work on the optimization of nested queries focuses on aggregate subqueries. However, the solutions proposed for non-aggregate subqueries are still limited, especially for queries having multiple subqueries and null values. In this paper, we show that existing approaches to queries containing non-aggregate subqueries proposed in the literature (including rewrites) are not adequate. We then propose a new efficient approach, the nested relational approach, based on the nested relational algebra. Our approach directly unnests non-aggregate subqueries using hash joins, and treats all subqueries in a uniform manner, being able to deal with nested queries of any type and any level. We report on experimental work that confirms that existing approaches have difficulties dealing with non-aggregate subqueries, and that our approach offers better performance. We also discuss some possibilities for algebraic optimization and the issue of integrating our approach in a relational database system. SIGMOD Conference Lazy XML Updates: Laziness as a Virtue of Update and Structural Join Efficiency. Barbara Catania,Wen Qiang Wang,Beng Chin Ooi,Xiaoling Wang 2005 Lazy XML Updates: Laziness as a Virtue of Update and Structural Join Efficiency. SIGMOD Conference Stratified Computation of Skylines with Partially-Ordered Domains. Chee Yong Chan,Pin-Kwang Eng,Kian-Lee Tan 2005 In this paper, we study the evaluation of skyline queries with partially-ordered attributes. Because such attributes lack a total ordering, traditional index-based evaluation algorithms (e.g., NN and BBS) that are designed for totally-ordered attributes can no longer prune the space as effectively. Our solution is to transform each partially-ordered attribute into a two-integer domain that allows us to exploit index-based algorithms to compute skyline queries on the transformed space. Based on this framework, we propose three novel algorithms: BBS+ is a straightforward adaptation of BBS using the framework, and SDC (Stratification by Dominance Classification) and SDC+ are optimized to handle false positives and support progressive evaluation. Both SDC and SDC+ exploit a dominance relationship to organize the data into strata. While SDC generates its strata at run time, SDC+ partitions the data into strata offline. We also design two dominance classification strategies (MinPC and MaxPC) to further optimize the performance of SDC and SDC+. We implemented the proposed schemes and evaluated their efficiency. Our results show that our proposed techniques outperform existing approaches by a wide margin, with SDC+-MinPC giving the best performance in terms of both response time as well as progressiveness. To the best of our knowledge, this is the first paper to address the problem of skyline query evaluation involving partially-ordered attribute domains. SIGMOD Conference Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri,Kris Ganjam,Venkatesh Ganti,Rahul Kapoor,Vivek R. Narasayya,Theo Vassilakis 2005 When collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive, challenge. Common data quality problems include inconsistent data conventions amongst sources such as different abbreviations or synonyms; data entry errors such as spelling mistakes; missing, incomplete, outdated or otherwise incorrect attribute values. These data defects generally manifest themselves as foreign-key mismatches and approximately duplicate records, both of which make further data mining and decision support analyses either impossible or suspect. We demonstrate two new data cleansing operators, Fuzzy Lookup and Fuzzy Grouping, which address these problems in a scalable and domain-independent manner. These operators are implemented within Microsoft SQL Server 2005 Integration Services. Our demo will explain their functionality and highlight multiple real-world scenarios in which they can be used to achieve high data quality. SIGMOD Conference When Can We Trust Progress Estimators for SQL Queries? Surajit Chaudhuri,Raghav Kaushik,Ravishankar Ramamurthy 2005 "The problem of estimating progress for long-running queries has recently been introduced. We analyze the characteristics of the progress estimation problem, from the perspective of providing robust, worst-case guarantees. Our first result is that in the worst case, no progress estimation algorithm can yield anything even moderately better than the trivial guarantee that identifies the progress as lying between 0% and 100%. In such cases, we introduce an estimator that can optimally bound the error. However, we show that in many ""good"" scenarios, it is possible to design effective progress estimators with small error bounds. We then demonstrate empirically that these ""good"" scenarios are common in practice and discuss possible ways of combining the estimators." SIGMOD Conference Foundations of automated database tuning. Surajit Chaudhuri,Gerhard Weikum 2005 1. The Challenge of Total Cost of-Ownership Our society is more dependent on information systems than ever before. However, managing the information systems infrastructure in a cost-effective manner is a growing challenge. The total cost of ownership (TCO) of information technology is increasingly dominated by people costs. In fact, mistakes in operations and administration of information systems are the single most reasons for system outage and unacceptable performance. For information systems to provide value to their customers, we must reduce the complexity associated with their deployment and usage. SIGMOD Conference On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. Ting Chen,Jiaheng Lu,Tok Wang Ling 2005 Searching for all occurrences of a twig pattern in an XML document is an important operation in XML query processing. Recently a holistic method TwigStack. [2] has been proposed. The method avoids generating large intermediate results which do not contribute to the final answer and is CPU and I/O optimal when twig patterns only have ancestor-descendant relationships. Another important direction of XML query processing is to build structural indexes [3][8][13][15] over XML documents to avoid unnecessary scanning of source documents. We regard XML structural indexing as a technique to partition XML documents and call it streaming scheme in our paper. In this paper we develop a method to perform holistic twig pattern matching on XML documents partitioned using various streaming schemes. Our method avoids unnecessary scanning of irrelevant portion of XML documents. More importantly, depending on different streaming schemes used, it can process a large class of twig patterns consisting of both ancestor-descendant and parent-child relationships and avoid generating redundant intermediate results. Our experiments demonstrate the applicability and the performance advantages of our approach. SIGMOD Conference Efficient Computation of Multiple Group By Queries. Zhimin Chen,Vivek R. Narasayya 2005 "Data analysts need to understand the quality of data in the warehouse. This is often done by issuing many Group By queries on the sets of columns of interest. Since the volume of data in these warehouses can be large, and tables in a data warehouse often contain many columns, this analysis typically requires executing a large number of Group By queries, which can be expensive. We show that the performance of today's database systems for such data analysis is inadequate. We also show that the problem is computationally hard, and develop efficient techniques for solving it. We demonstrate significant speedup over existing approaches on today's commercial database systems." SIGMOD Conference Robust and Fast Similarity Search for Moving Object Trajectories. Lei Chen,M. Tamer Özsu,Vincent Oria 2005 An important consideration in similarity-based retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, disturbance signals, and different sampling rates. Cleaning data to eliminate these is not always possible. In this paper, we introduce a novel distance function, Edit Distance on Real sequence (EDR) which is robust against these data imperfections. Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences (LCSS), indicate that EDR is more robust than Euclidean distance, DTW and ERP, and it is on average 50% more accurate than LCSS. We also develop three pruning techniques to improve the retrieval efficiency of EDR and show that these techniques can be combined effectively in a search, increasing the pruning power significantly. The experimental results confirm the superior efficiency of the combined methods. SIGMOD Conference DBNotes: a post-it system for relational databases based on provenance. Laura Chiticariu,Wang Chiew Tan,Gaurav Vijayvargiya 2005 We demonstrate DBNotes, a Post-It note system for relational databases where every piece of data may be associated with zero or more notes (or annotations). These annotations are transparently propagated along as data is being transformed. The method by which annotations are propagated is based on provenance (aka lineage): the annotations associated with a piece of data d in the result of a transformation consist of the annotations associated with each piece of data in the source where d is copied from. One immediate application of this system is to use annotations to systematically trace the provenance and flow of data. If every piece of source data is attached with an annotation that describes its address (i.e., origins), then the annotations of a piece of data in the result of a transformation describe its provenance. Hence, one can easily determine the provenance of data through a sequence of transformation steps simply by examining the annotations. Annotations can also be used to store additional information about data. Since a database schema is often proprietary, the ability to insert new information about data without having to change the underlying schema is a useful feature. For example, an error report could be attached to an erroneous piece of data, and this error report will be propagated to other databases along transformations, thus notifying other users of the error. Overall, the annotations on the result of a transformation can also provide an estimate on the quality of the resulting database. SIGMOD Conference Page Quality: In Search of an Unbiased Web Ranking. Junghoo Cho,Sourashis Roy,Robert Adams 2005 "In a number of recent studies [4, 8] researchers have found that because search engines repeatedly return currently popular pages at the top of search results, popular pages tend to get even more popular, while unpopular pages get ignored by an average user. This ""rich-get-richer"" phenomenon is particularly problematic for new and high-quality pages because they may never get a chance to get users' attention, decreasing the overall quality of search results in the long run. In this paper, we propose a new ranking function, called page quality that can alleviate the problem of popularity-based ranking. We first present a formal framework to study the search engine bias by discussing what is an ""ideal"" way to measure the intrinsic quality of a page. We then compare how PageRank, the current ranking metric used by major search engines, differs from this ideal quality metric. This framework will help us investigate the search engine bias in more concrete terms and provide clear understanding why PageRank is effective in many cases and exactly when it is problematic. We then propose a practical way to estimate the intrinsic page quality to avoid the inherent bias of PageRank. We derive our proposed quality estimator through a careful analysis of a reasonable web user model, and we present experimental results that show the potential of our proposed estimator. We believe that our quality estimator has the potential to alleviate the rich-get-richer phenomenon and help new and high-quality pages get the attention that they deserve." SIGMOD Conference Integration of structured and unstructured data in IBM content manager. David M. Choy 2005 Integration of structured and unstructured data goes much deeper than supporting large objects in a database. Through an architecture overview of the IBM Content Manager, this paper examines some of the requirements, challenges, and solutions in managing a large volume of content and in support of a wide range of content applications. The discussion touches upon system architecture, data model, and access control. SIGMOD Conference Mining Top-k Covering Rule Groups for Gene Expression Data. Gao Cong,Kian-Lee Tan,Anthony K. H. Tung,Xin Xu 2005 In this paper, we propose a novel algorithm to discover the top-k covering rule groups for each row of gene expression profiles. Several experiments on real bioinformatics datasets show that the new top-k covering rule mining algorithm is orders of magnitude faster than previous association rule mining algorithms.Furthermore, we propose a new classification method RCBT. RCBT classifier is constructed from the top-k covering rule groups. The rule groups generated for building RCBT are bounded in number. This is in contrast to existing rule-based classification methods like CBA [19] which despite generating excessive number of redundant rules, is still unable to cover some training data with the discovered rules. Experiments show that the RCBT classifier can match or outperform other state-of-the-art classifiers on several benchmark gene expression datasets. In addition, the top-k covering rule groups themselves provide insights into the mechanisms responsible for diseases directly. SIGMOD Conference Goals and Benchmarks for Autonomic Configuration Recommenders. Mariano P. Consens,Denilson Barbosa,Adrian M. Teisanu,Laurent Mignet 2005 We are witnessing an explosive increase in the complexity of the information systems we rely upon, Autonomic systems address this challenge by continuously configuring and tuning themselves. Recently, a number of autonomic features have been incorporated into commercial RDBMS; tools for recommending database configurations (i.e., indexes, materialized views, partitions) for a given workload are prominent examples of this promising trend.In this paper, we introduce a flexible characterization of the performance goals of configuration recommenders and develop an experimental evaluation approach to benchmark the effectiveness of these autonomic tools. We focus on exploratory queries and present extensive experimental results using both real and synthetic data that demonstrate the validity of the approach introduced. Our results identify a specific index configuration based on single-column indexes as a very useful baseline for comparisons in the exploratory setting. Furthermore, the experimental results demonstrate the unfulfilled potential for achieving improvements of several orders of magnitude. SIGMOD Conference Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles. Graham Cormode,Minos N. Garofalakis,S. Muthukrishnan,Rajeev Rastogi 2005 "While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting --- our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., ""heavy-hitters"" queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees." SIGMOD Conference "IBM SOA ""on the edge""." Gennaro A. Cuomo 2005 This paper introduces the concept of an SOA edge server and a set of complementary design patterns designed to optimize performance, improve manageability and enable customers to cost effectively deploy SOA applications into complex, mission-critical, high-volume distributed environments. SIGMOD Conference Stacked indexed views in microsoft SQL server. David DeHaan,Per-Åke Larson,Jingren Zhou 2005 Appropriately selected materialized views (also called indexed views) can speed up query execution by orders of magnitude. Most database systems limit support for materialized views to select-project-join expressions, possibly with a group-by, over base tables because this class of views can be efficiently maintained incrementally and thus kept up to date with the underlying source tables. However, limiting views to reference only base tables restricts the class of queries that can be supported by materialized views. View stacking (also called views on views) relaxes one restriction by allowing a materialized view to reference both base tables and other materialized views. This extends materialized view support to additional types of queries. This paper describes a prototype implementation of stacked views within Microsoft SQL Server and explains which classes of queries can be supported. To support view matching for stacked views, a signature mechanism was added to the optimizer. This mechanism turned out to be beneficial also for regular views by significantly speeding up view matching. SIGMOD Conference Predicate Result Range Caching for Continuous Queries. Matthew Denny,Michael J. Franklin 2005 Many analysis and monitoring applications require the repeated execution of expensive modeling functions over streams of rapidly changing data. These applications can often be expressed declaratively, but the continuous query processors developed to date are not designed to optimize queries with expensive functions. To speed up such queries, we present CASPER: the CAching System for PrEdicate Result ranges. CASPER computes and caches predicate result ranges, which are ranges of stream input values where the system knows the results of expensive predicate evaluations. Over time, CASPER expands ranges so that they are more likely to contain future stream values. This paper presents the CASPER architecture, as well as algorithms for computing and expanding ranges for a large class of predicates. We demonstrate the effectiveness of CASPER using a prototype implementation and a financial application using real bond market data. SIGMOD Conference A Verifier for Interactive, Data-Driven Web Applications. Alin Deutsch,Monica Marcus,Liying Sui,Victor Vianu,Dayou Zhou 2005 We present WAVE, a verifier for interactive, database-driven Web applications specified using high-level modeling tools such as WebML. WAVE is complete for a broad class of applications and temporal properties. For other applications, WAVE can be used as an incomplete verifier, as commonly done in software verification. Our experiments on four representative data-driven applications and a battery of common properties yielded surprisingly good verification times, on the order of seconds. This suggests that interactive applications controlled by database queries may be unusually well suited to automatic verification. They also show that the coupling of model checking with database optimization techniques used in the implementation of WAVE can be extremely effective. This is significant both to the database area and to automatic verification in general. SIGMOD Conference Reference Reconciliation in Complex Information Spaces. Xin Dong,Alon Y. Halevy,Jayant Madhavan 2005 "Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one's desktop.Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark." SIGMOD Conference AGILE: Adaptive Indexing for Context-Aware Information Filters. Jens-Peter Dittrich,Peter M. Fischer,Donald Kossmann 2005 Information filtering has become a key technology for modern information systems. The goal of an information filter is to route messages to the right recipients (possibly none) according to declarative rules called profiles. In order to deal with high volumes of messages, several index structures have been proposed in the past. The challenge addressed in this paper is to carry out stateful information filtering in which profiles refer to values in a database or to previous messages. The difficulty is that database update streams need to be processed in addition to messages. This paper presents AGILE, a way to extend existing index structures so that the indexes adapt to the message/update workload and show good performance in all situations. Performance experiments show that AGILE is overall the clear winner as compared to the best existing approaches. In extreme situations in which it is not the winner, the overheads are small. SIGMOD Conference Relational data mapping in MIQIS. George H. L. Fletcher,Catharine M. Wyss 2005 "We demonstrate a prototype of the relational data mapping module of MIQIS, a formal framework for investigating information flow in peer-to-peer database management systems. Data maps constitute effective mappings between structured data sources. These mappings are the `glue' for facilitating large scale ad-hoc information sharing between autonomous peers, and automating their discovery is one of the fundamental unsolved challenges for information interoperability and sharing. Our approach to automating data map discovery utilizes heuristic search within a space delineated by basic relational transformation operators. A novelty of our approach is that these operators include data to metadata transformations (and vice versa). This approach leverages new perspectives on the data mapping problem, and generalizes previous approaches such as token-based schema matching." SIGMOD Conference Meta-data version and configuration management in multi-vendor environments. John R. Friedrich 2005 "Nearly all components that comprise modern information technology, such as Computer Aided Software Engineering (CASE) tools, Enterprise Application Integration (EAI) environments, Extract/Transform/Load (ETL) engines. Warehouses, EII, and Business Intelligence (BI), contain a great deal of meta-data, which often drive much of the tool's functionality. These metadata are distributed and duplicated, are often times actively interacting with the tools as they process data, and are generally represented in a variety of methodologies. Meta-data exchange and reuse is now becoming commonplace. This article is based upon the real challenges found in these complicated meta-data environments, and identifies the often overlooked distinctions and importance of meta-data version and configuration management (CM), including the extensive use of automated meta-data comparison, mapping comparison, mapping generation and mapping update functions, which comprise a complete meta-data CM environment. Also addressed is the reality that most repositories are not up to the task of true version and configuration management, and thus true impact and lineage analysis, as their emphasis has been on the development a single enterprise architecture and the concept of ""a single version of the truth.""" SIGMOD Conference Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors. Naga K. Govindaraju,Nikunj Raghuvanshi,Dinesh Manocha 2005 We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and construction of ε-approximate quantile and frequency summaries. Our algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variable-sized sliding windows and use a limited memory footprint. We use GPU as a co-processor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We also compared the performance of our GPU-based algorithms with optimized implementations of prior CPU-based algorithms. Overall, our results demonstrate that the graphics processors available on a commodity computer system are efficient stream-processor and useful co-processors for mining data streams. SIGMOD Conference ConQuer: Efficient Management of Inconsistent Databases. Ariel Fuxman,Elham Fazli,Renée J. Miller 2005 Although integrity constraints have long been used to maintain data consistency, there are situations in which they may not be enforced or satisfied. In this paper, we present ConQuer, a system for efficient and scalable answering of SQL queries on databases that may violate a set of constraints. ConQuer permits users to postulate a set of key constraints together with their queries. The system rewrites the queries to retrieve all (and only) data that is consistent with respect to the constraints. The rewriting is into SQL, so the rewritten queries can be efficiently optimized and executed by commercial database systems.We study the overhead of resolving inconsistencies dynamically (at query time). In particular, we present a set of performance experiments that compare the efficiency of the rewriting strategies used by ConQuer. The experiments use queries taken from the TPC-H workload. We show that the overhead is not onerous, and the consistent query answers can often be computed within twice the time required to obtain the answers to the original (non-rewritten) query. SIGMOD Conference A high-performance, transactional filestore for application servers. Bill Gallagher,Dean Jacobs,Anno Langen 2005 There is a class of data, including messages and business workflow state, for which conventional monolithic databases are less than ideal. Performance and scalability of Application Server systems can be dramatically increased by distributing such data across transactional filestores, each of which is bound to a server instance in a cluster. This paper describes a high-performance, transactional filestore that has been developed for the BEA WebLogic Application ServerTM and benchmarks it against a database. The filestore uses a novel, platform-independent disk scheduling algorithm to minimize the latency of small, synchronous writes to disk. SIGMOD Conference XML and relational database management systems: inside Microsoft SQL Server 2005. Michael Rys 2005 XML and relational database management systems: inside Microsoft SQL Server 2005. SIGMOD Conference Efficient Keyword Search for Smallest LCAs in XML Databases. Yu Xu,Yannis Papakonstantinou 2005 "Keyword search is a proven, user-friendly way to query HTML documents in the World Wide Web. We propose keyword search in XML documents, modeled as labeled trees, and describe corresponding efficient algorithms. The proposed keyword search returns the set of smallest trees containing all keywords, where a tree is designated as ""smallest"" if it contains no tree that also contains all keywords. Our core contribution, the Indexed Lookup Eager algorithm, exploits key properties of smallest trees in order to outperform prior algorithms by orders of magnitude when the query contains keywords with significantly different frequencies. The Scan Eager variant is tuned for the case where the keywords have similar frequencies. We analytically and experimentally evaluate two variants of the Eager algorithm, along with the Stack algorithm [13]. We also present the XKSearch system, which utilizes the Indexed Lookup Eager, Scan Eager and Stack algorithms and a demo of which on DBLP data is available at http://www.db.ucsd.edu/projects/xksearch. Finally, we extend the Indexed Lookup Eager algorithm to answer Lowest Common Ancestor (LCA) queries." SIGMOD Conference The INFOMIX system for advanced integration of incomplete and inconsistent data. Nicola Leone,Gianluigi Greco,Giovambattista Ianni,Vincenzino Lio,Giorgio Terracina,Thomas Eiter,Wolfgang Faber,Michael Fink,Georg Gottlob,Riccardo Rosati,Domenico Lembo,Maurizio Lenzerini,Marco Ruzzi,Edyta Kalka,Bartosz Nowicki,Witold Staniszkis 2005 The task of an information integration system is to combine data residing at different sources, providing the user with a unified view of them, called global schema. Users formulate queries over the global schema, and the system suitably queries the sources, providing an answer to the user, who is not obliged to have any information about the sources. Recent developments in IT such as the expansion of the Internet and the World Wide Web, have made available to users a huge number of information sources, generally autonomous, heterogeneous and widely distributed: as a consequence, information integration has emerged as a crucial issue in many application domains, e.g., distributed databases, cooperative information systems, data warehousing, or on-demand computing. Recent estimates view information integration to be a $10 Billion market by 2006 [14]. SIGMOD Conference Update-Pattern-Aware Modeling and Processing of Continuous Queries. Lukasz Golab,M. Tamer Özsu 2005 Update-Pattern-Aware Modeling and Processing of Continuous Queries. SIGMOD Conference Clio grows up: from research prototype to industrial tool. Laura M. Haas,Mauricio A. Hernández,Howard Ho,Lucian Popa,Mary Roth 2005 "Clio, the IBM Research system for expressing declarative schema mappings, has progressed in the past few years from a research prototype into a technology that is behind some of IBM's mapping technology. Clio provides a declarative way of specifying schema mappings between either XML or relational schemas. Mappings are compiled into an abstract query graph representation that captures the transformation semantics of the mappings. The query graph can then be serialized into different query languages, depending on the kind of schemas and systems involved in the mapping. Clio currently produces XQuery, XSLT, SQL, and SQL/XML queries. In this paper, we revisit the architecture and algorithms behind Clio. We then discuss some implementation issues, optimizations needed for scalability, and general lessons learned in the road towards creating an industrial-strength tool." SIGMOD Conference Automated statistics collection in action. Peter J. Haas,Mokhtar Kandil,Alberto Lerner,Volker Markl,Ivan Popivanov,Vijayshankar Raman,Daniel C. Zilio 2005 If presented with inaccurate statistics, even the most sophisticated query optimizers make mistakes. They may wrongly estimate the output cardinality of a certain operation and thus make sub-optimal plan choices based on that cardinality. Maintaining accurate statistics is hard, both because each table may need a specifically parameterized set of statistics and because statistics get outdated as the database changes. Automated Statistic Collection (ASC) is a new component in IBM DB2 UDB that, without any DBA intervention, observes and analyzes the effects of faulty statistics and, in response, it triggers actions that continuously repair the latter. In this demonstration, we will show how ASC works to alleviate the DBA from the task of maintaining fresh, accurate statistics in several challenging scenarios. ASC is able to reconfigure the statistics collection parameters (e.g, number of frequent values for a column, or correlations between certain column pairs) on a per-table basis. ASC can also detect and guard against outdated statistics caused by high updates/inserts/deletes rates in volatile, dynamic databases. We will also show how ASC works from the inside: from how cardinality mis-estimations are introduced in different kind of operators, to how this error is propagated to later operations in the plan, to how this influences plan choices inside the optimizer. SIGMOD Conference Enterprise information integration: successes, challenges and controversies. Alon Y. Halevy,Naveen Ashish,Dina Bitton,Michael J. Carey,Denise Draper,Jeff Pollock,Arnon Rosenthal,Vishal Sikka 2005 "The goal of EII systems is to provide uniform access to multiple data sources without having to first load them into a data warehouse. Since the late 1990's, several EII products have appeared in the marketplace and significant experience has been accumulated from fielding such systems. This collection of articles, by individuals who were involved in this industry in various ways, describes some of these experiences and points to the challenges ahead." SIGMOD Conference QPipe: A Simultaneously Pipelined Relational Query Engine. Stavros Harizopoulos,Vladislav Shkapenyuk,Anastassia Ailamaki 2005 "Relational DBMS typically execute concurrent queries independently by invoking a set of operator instances for each query. To exploit common data retrievals and computation in concurrent queries, researchers have proposed a wealth of techniques, ranging from buffering disk pages to constructing materialized views and optimizing multiple queries. The ideas proposed, however, are inherently limited by the query-centric philosophy of modern engine designs. Ideally, the query engine should proactively coordinate same-operator execution among concurrent queries, thereby exploiting common accesses to memory and disks as well as common intermediate result computation.This paper introduces on-demand simultaneous pipelining (OSP), a novel query evaluation paradigm for maximizing data and work sharing across concurrent queries at execution time. OSP enables proactive, dynamic operator sharing by pipelining the operator's output simultaneously to multiple parent nodes. This paper also introduces QPipe, a new operator-centric relational engine that effortlessly supports OSP. Each relational operator is encapsulated in a micro-engine serving query tasks from a queue, naturally exploiting all data and work sharing opportunities. Evaluation of QPipe built on top of BerkeleyDB shows that QPipe achieves a 2x speedup over a commercial DBMS when running a workload consisting of TPC-H queries." SIGMOD Conference Information intelligence: metadata for information discovery, access, and integration. Randall Hauch,Alex Miller,Rob Cardwell 2005 "Integrating enterprise information requires an accurate, precise and complete understanding of the disparate data sources, the needs of the information consumers, and how these map to the semantic business concepts of the enterprise. We describe how MetaMatrix captures and manages this metadata through the use of the OMG's MOF architecture and multiple domain-specific modeling languages, and how this semantic and syntactic metadata is then used for a variety of purposes, including accessing data in real-time from the underlying enterprise systems, integrating it, and returning it as information expected by consumers." SIGMOD Conference MetaQuerier: querying structured web sources on-the-fly. Bin He,Zhen Zhang,Kevin Chen-Chuan Chang 2005 Recently, we witness the rapid growth and thus the prevalence of databases on the Web. Our recent survey [2] in April 2004 estimated 450,000 online databases. On this deep Web, myriad online databases provide dynamic query-based data access through their query interfaces, instead of static URL links. As the door to the deep Web, it is essential to integrate these query interfaces for integrating the deep Web. SIGMOD Conference A Generic Framework for Monitoring Continuous Spatial Queries over Moving Objects. Haibo Hu,Jianliang Xu,Dik Lun Lee 2005 This paper proposes a generic framework for monitoring continuous spatial queries over moving objects. The framework distinguishes itself from existing work by being the first to address the location update issue and to provide a common interface for monitoring mixed types of queries. Based on the notion of safe region, the client location update strategy is developed based on the queries being monitored. Thus, it significantly reduces the wireless communication and query reevaluation costs required to maintain the up-to-date query results. We propose algorithms for query evaluation/reevaluation and for safe region computation in this framework. Enhancements are also proposed to take advantage of two practical mobility assumptions: maximum speed and steady movement. The experimental results show that our framework substantially outperforms the traditional periodic monitoring scheme in terms of monitoring accuracy and CPU time while achieving a close-to-optimal wireless communication cost. The framework also can scale up to a large monitoring system and is robust under various object mobility patterns. SIGMOD Conference Deriving Private Information from Randomized Data. Zhengli Huang,Wenliang Du,Biao Chen 2005 "Randomization has emerged as a useful technique for data disguising in privacy-preserving data mining. Its privacy properties have been studied in a number of papers. Kargupta et al. challenged the randomization schemes, and they pointed out that randomization might not be able to preserve privacy. However, it is still unclear what factors cause such a security breach, how they affect the privacy preserving property of the randomization, and what kinds of data have higher risk of disclosing their private contents even though they are randomized.We believe that the key factor is the correlations among attributes. We propose two data reconstruction methods that are based on data correlations. One method uses the Principal Component Analysis (PCA) technique, and the other method uses the Bayes Estimate (BE) technique. We have conducted theoretical and experimental analysis on the relationship between data correlations and the amount of private information that can be disclosed based our proposed data reconstructions schemes. Our studies have shown that when the correlations are high, the original data can be reconstructed more accurately, i.e., more private information can be disclosed.To improve privacy, we propose a modified randomization scheme, in which we let the correlation of random noises ""similar"" to the original data. Our results have shown that the reconstruction accuracy of both PCA-based and BE-based schemes become worse as the similarity increases." SIGMOD Conference A framework for processing complex document-centric XML with overlapping structures. Ionut Emil Iacob,Alex Dekhtyar 2005 Management of multihierarchical XML encodings has attracted attention of a number of researchers both in databases [8] and in humanities[10]. Encoding documents using multiple hierarchies can yield overlapping markup. Previously proposed solutions to management of document-centric XML with overlapping markup rely on the XML expertise of humans and their ability to maintain correct schemas for complex markup languages.We demonstrate a unified solution for management of complex, multihierarchical document-centric XML. Our framework includes software for storing, parsing, in-memory access, editing and querying, multihierarchical XML documents with conflicting structures. SIGMOD Conference ProDA: a suite of web-services for progressive data analysis. Mehrdad Jahangiri,Cyrus Shahabi 2005 Online Scientific Applications (OSA) require statistical analysis of large multidimensional datasets. Towards this end, we have designed and developed a data storage and retrieval system, called ProDA, which deploys wavelet transform and provides fast approximate answers with progressively increasing accuracy in support of the OSA queries. ProDA employs a standard web-service infrastructure to enable remote users to interact with their data. These web-services enable wavelet transformation of large multidimensional datasets as well as inserting, updating, and exact, approximate and progressive querying of these datasets in the wavelet domain. We demonstrate the features of ProDA on a massive atmospheric dataset provided to us by NASA/JPL. SIGMOD Conference SHIFT-SPLIT: I/O Efficient Maintenance of Wavelet-Transformed Multidimensional Data. Mehrdad Jahangiri,Dimitris Sacharidis,Cyrus Shahabi 2005 The Discrete Wavelet Transform is a proven tool for a wide range of database applications. However, despite broad acceptance, some of its properties have not been fully explored and thus not exploited, particularly for two common forms of multidimensional decomposition. We introduce two novel operations for wavelet transformed data, termed SHIFT and SPLIT, based on the properties of wavelet trees, which work directly in the wavelet domain. We demonstrate their significance and usefulness by analytically proving six important results in four common data maintenance scenarios, i.e., transformation of massive datasets, appending data, approximation of data streams and partial data reconstruction, leading to significant I/O cost reduction in all cases. Furthermore, we show how these operations can be further improved in combination with the optimal coefficient-to-disk-block allocation strategy. Our exhaustive set of empirical experiments with real-world datasets verifies our claims. SIGMOD Conference A Disk-Based Join With Probabilistic Guarantees. Chris Jermaine,Alin Dobra,Subramanian Arumugam,Shantanu Joshi,Abhijit Pol 2005 "One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm for computing the answer to such a query over large, disk-based input tables. The key innovation of our algorithm is that at all times, it provides an online, statistical estimator for the eventual answer to the query, as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimate's accuracy, or run the algorithm to completion with a total time requirement that is not much longer than other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into core memory." SIGMOD Conference Sampling Algorithms in a Stream Operator. Theodore Johnson,S. Muthukrishnan,Irina Rozenbaum 2005 Complex queries over high speed data streams often need to rely on approximations to keep up with their input. The research community has developed a rich literature on approximate streaming algorithms for this application. Many of these algorithms produce samples of the input stream, providing better properties than conventional random sampling. In this paper, we abstract the stream sampling process and design a new stream sample operator. We show how it can be used to implement a wide variety of algorithms that perform sampling and sampling-based aggregations. Also, we show how to implement the operator in Gigascope - a high speed stream database specialized for IP network monitoring applications. As an example study, we apply the operator within such an enhanced Gigascope to perform subset-sum sampling which is of great interest for IP network management. We evaluate this implemention on a live, high speed internet traffic data stream and find that (a) the operator is a flexible, versatile addition to Gigascope suitable for tuning and algorithm engineering, and (b) the operator imposes only a small evaluation overhead. This is the first operational implementation we know of, for a wide variety of stream sampling algorithms at line speed within a data stream management system. SIGMOD Conference SPIDER: flexible matching in databases. Nick Koudas,Amit Marathe,Divesh Srivastava 2005 We present a prototype system, SPIDER, developed at AT&T Labs-Research, which supports flexible string attribute value matching in large databases. We discuss the design principles on which SPIDER is based, describe the basic techniques encompassed by the tool and provide a description of the demo. SIGMOD Conference Constrained Optimalities in Query Personalization. Georgia Koutrika,Yannis E. Ioannidis 2005 "Personalization is a powerful mechanism that helps users to cope with the abundance of information on the Web. Database query personalization achieves this by dynamically constructing queries that return results of high interest to the user. This, however, may conflict with other constraints on the query execution time and/or result size that may be imposed by the search context, such as the device used, the network connection, etc. For example, if the user is accessing information using a mobile phone, then it is desirable to construct a personalized query that executes quickly and returns a handful of answers. Constrained Query Personalization (CQP) is an integrated approach to database query answering that dynamically takes into account the queries issued, the user's interest in the results, response time, and result size in order to build personalized queries. In this paper, we introduce CQP as a family of constrained optimization problems, where each time one of the parameters of concern is optimized while the others remain within the bounds of range constraints. Taking into account some key (exact or approximate) properties of these parameters, we map CQP to a state search problem and provide several algorithms for the discovery of optimal solutions. Experimental results demonstrate the effectiveness of the proposed techniques and the appropriateness of the overall approach." SIGMOD Conference To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. Laks V. S. Lakshmanan,Raymond T. Ng,Ganesh Ramesh 2005 "Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis a vis the risk of disclosing proprietary or sensitive information. While there are various ""sanitization"" methods, in this paper we focus on anonymization, given its widespread use in practice. We give due diligence to the question of ""just how safe the anonymized data is"", in terms of protecting the true identities of the data objects. We consider both the scenarios when the hacker has no information, and more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining. We propose to capture the prior knowledge of the hacker by means of a belief function, where an educated guess of the frequency of each item is assumed. For various classes of belief functions, which correspond to different degrees of prior knowledge, we derive formulas for computing the expected number of ""cracks"". While obtaining the exact values for the more general situations is computationally hard, we propose a heuristic called the O-estimate. It is easy to compute, and is shown to be accurate empirically with real benchmark datasets. Finally, based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma." SIGMOD Conference Incognito: Efficient Full-Domain K-Anonymity. Kristen LeFevre,David J. DeWitt,Raghu Ramakrishnan 2005 "A number of organizations publish microdata for purposes such as public health and demographic research. Although attributes that clearly identify individuals, such as Name and Social Security Number, are generally removed, these databases can sometimes be joined with other public databases on attributes such as Zipcode, Sex, and Birthdate to re-identify individuals who were supposed to remain anonymous. ""Joining"" attacks are made easier by the availability of other, complementary, databases over the Internet.K-anonymization is a technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so that no individual can be uniquely distinguished from a group of size k. In this paper, we provide a practical framework for implementing one model of k-anonymization, called full-domain generalization. We introduce a set of algorithms for producing minimal full-domain generalizations, and show that these algorithms perform up to an order of magnitude faster than previous algorithms on two real-life databases.Besides full-domain generalization, numerous other models have also been proposed for k-anonymization. The second contribution in this paper is a single taxonomy that categorizes previous models and introduces some promising new alternatives." SIGMOD Conference STRG-Index: Spatio-Temporal Region Graph Indexing for Large Video Databases. JeongKyu Lee,Jung-Hwan Oh,Sae Hwang 2005 In this paper, we propose new graph-based data structure and indexing to organize and retrieve video data. Several researches have shown that a graph can be a better candidate for modeling semantically rich and complicated multimedia data. However, there are few methods that consider the temporal feature of video data, which is a distinguishable and representative characteristic when compared with other multimedia (i.e., images). In order to consider the temporal feature effectively and efficiently, we propose a new graph-based data structure called Spatio-Temporal Region Graph (STRG). Unlike existing graph-based data structures which provide only spatial features, the proposed STRG further provides temporal features, which represent temporal relationships among spatial objects. The STRG is decomposed into its subgraphs in which redundant subgraphs are eliminated to reduce the index size and search time, because the computational complexity of graph matching (subgraph isomorphism) is NP-complete. In addition, a new distance measure, called Extended Graph Edit Distance (EGED), is introduced in both non-metric and metric spaces for matching and indexing respectively. Based on STRG and EGED, we propose a new indexing method STRG-Index, which is faster and more accurate since it uses tree structure and clustering algorithm. We compare the STRG-Index with the M-tree, which is a popular tree-based indexing method for multimedia data. The STRG-Index outperforms the M-tree for various query loads in terms of cost and speed. SIGMOD Conference Immortal DB: transaction time support for SQL server. David B. Lomet,Roger S. Barga,Mohamed F. Mokbel,German Shegalov,Rui Wang,Yunyue Zhu 2005 "Immortal DB builds transaction time database support into the SQL Server engine, not in middleware. Transaction time databases retain and provide access to prior states of a database. An update ""inserts"" a new record while preserving the old version. The system supports as of queries returning records current at the specified time. It also supports snapshot isolation concurrency control. Versions are stamped with the times of their updating transactions. The timestamp order agrees with transaction serialization order. Lazy timestamping propagates timestamps to all updates of a transaction after commit. All versions are kept in an integrated storage structure, with historical versions initially stored with current data. Time-splits of pages permit large histories to be maintained, and enable time based indexing. We demonstrate Immortal DB with a moving objects application that tracks cars in the Seattle area." SIGMOD Conference RankSQL: Query Algebra and Optimization for Relational Top-k Queries. Chengkai Li,Kevin Chen-Chuan Chang,Ihab F. Ilyas,Sumin Song 2005 "This paper introduces RankSQL, a system that provides a systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending relational algebra and query optimization. Previously, top-k query processing is studied in the middleware scenario or in RDBMS in a ""piecemeal"" fashion, i.e., focusing on specific operator or sitting outside the core of query engines. In contrast, we aim to support ranking as a first-class database construct. As a key insight, the new ranking relationship can be viewed as another logical property of data, parallel to the ""membership"" property of relational data model. While membership is essentially supported in RDBMS, the same support for ranking is clearly lacking. We address the fundamental integration of ranking in RDBMS in a way similar to how membership, i.e., Boolean filtering, is supported. We extend relational algebra by proposing a rank-relational model to capture the ranking property, and introducing new and extended operators to support ranking as a first-class construct. Enabled by the extended algebra, we present a pipelined and incremental execution model of ranking query plans (that cannot be expressed traditionally) based on a fundamental ranking principle. To optimize top-k queries, we propose a dimensional enumeration algorithm to explore the extended plan space by enumerating plans along two dual dimensions: ranking and membership. We also propose a sampling-based method to estimate the cardinality of rank-aware operators, for costing plans. Our experiments show the validity of our framework and the accuracy of the proposed estimation model." SIGMOD Conference Semantics and Evaluation Techniques for Window Aggregates in Data Streams. Jin Li,David Maier,Kristin Tufte,Vassilis Papadimos,Peter A. Tucker 2005 A windowed query operator breaks a data stream into possibly overlapping subsets of data and computes a result over each. Many stream systems can evaluate window aggregate queries. However, current stream systems suffer from a lack of an explicit definition of window semantics. As a result, their implementations unnecessarily confuse window definition with physical stream properties. This confusion complicates the stream system, and even worse, can hurt performance both in terms of memory usage and execution time. To address this problem, we propose a framework for defining window semantics, which can be used to express almost all types of windows of which we are aware, and which is easily extensible to other types of windows that may occur in the future. Based on this definition, we explore a one-pass query evaluation strategy, the Window-ID (WID) approach, for various types of window aggregate queries. WID significantly reduces both required memory space and execution time for a large class of window definitions. In addition, WID can leverage punctuations to gracefully handle disorder. Our experimental study shows that WID has better execution-time performance than existing window aggregate query evaluation options that retain and reprocess tuples, and has better latency-accuracy tradeoffs for disordered input streams compared to using a fixed delay for handling disorder. SIGMOD Conference NaLIX: an interactive natural language interface for querying XML. Yunyao Li,Huahai Yang,H. V. Jagadish 2005 Database query languages can be intimidating to the non-expert, leading to the immense recent popularity for keyword based search in spite of its significant limitations. The holy grail has been the development of a natural language query interface. We present NaLIX, a generic interactive natural language query interface to an XML database. Our system can accept an arbitrary English language sentence as query input, which can include aggregation, nesting, and value joins, among other things. This query is translated, potentially after reformulation, into an XQuery expression that can be evaluated against an XML database. The translation is done through mapping grammatical proximity of natural language parsed tokens to proximity of corresponding elements in the result XML. In this demonstration, we show that NaLIX, while far from being able to pass the Turing test, is perfectly usable in practice, and able to handle even quite complex queries in a variety of application domains. In addition, we also demonstrate how carefully designed features in NaLIX facilitate the interactive query process and improve the usability of the interface. SIGMOD Conference Middleware based Data Replication providing Snapshot Isolation. Yi Lin,Bettina Kemme,Marta Patiño-Martínez,Ricardo Jiménez-Peris 2005 "Many cluster based replication solutions have been proposed providing scalability and fault-tolerance. Many of these solutions perform replica control in a middleware on top of the database replicas. In such a setting concurrency control is a challenge and is often performed on a table basis. Additionally, some systems put severe requirements on transaction programs (e.g., to declare all objects to be accessed in advance). This paper addresses these issues and presents a middleware-based replication scheme which provides the popular snapshot isolation level at the same tuple-level granularity as database systems like PostgreSQL and Oracle, without any need to declare transaction properties in advance. Both read-only and update transactions can be executed at any replica while providing data consistency at all times. Our approach provides what we call ""1-copy-snapshot-isolation"" as long as the underlying database replicas provide snapshot isolation. We have implemented our approach as a replicated middleware on top of PostgreSQL replicas. By providing a standard JDBC interface, the middleware is completely transparent to the client program. Fault-tolerance is provided by automatically reconnecting clients in case of crashes. Our middleware shows good performance in terms of response times and scalability." SIGMOD Conference Guaranteeing Correctness and Availability in P2P Range Indices. Prakash Linga,Adina Crainiceanu,Johannes Gehrke,Jayavel Shanmugasundaram 2005 New and emerging P2P applications require sophisticated range query capability and also have strict requirements on query correctness, system availability and item availability. While there has been recent work on developing new P2P range indices, none of these indices guarantee correctness and availability. In this paper, we develop new techniques that can provably guarantee the correctness and availability of P2P range indices. We develop our techniques in the context of a general P2P indexing framework that can be instantiated with most P2P index structures from the literature. As a specific instantiation, we implement P-Ring, an existing P2P range index, and show how it can be extended to guarantee correctness and availability. We quantitatively evaluate our techniques using a real distributed implementation. SIGMOD Conference A native extension of SQL for mining data streams. Chang Luo,Hetal Thakkar,Haixun Wang,Carlo Zaniolo 2005 ESL1 enables users to develop stream applications in an SQL-like high level language that provides the ease-of-use of a declarative language, which is Turing complete in terms of expressive power [11]. SIGMOD Conference Native Xquery processing in oracle XMLDB. Zhen Hua Liu,Muralidhar Krishnaprasad,Vikas Arora 2005 With XQuery becoming the standard language for querying XML, and the relational SQL platform being recognized as an important platform to store and process XML, the SQL/XML standard is integrating XML query capability into the SQL system by introducing new SQL functions and constructs such as XMLQuery() and XMLTable. This paper discusses the Oracle XMLDB XQuery architecture for supporting XQuery in the Oracle ORDBMS kernel which has the XQuery processing tightly integrated with the SQL/XML engine using native XQuery compilation, optimization and execution techniques. SIGMOD Conference A system for analyzing and indexing human-motion databases. Guodong Liu,Jingdan Zhang,Wei Wang,Leonard McMillan 2005 We demonstrate a data-driven approach for representing, compressing, and indexing human-motion databases. Our modeling approach is based on piecewise-linear components that are determined via a divisive clustering method. Selection of the appropriate linear model is determined automatically via a classifier using a subspace of the most significant, or principle features (markers). We show that, after offline training, our model can accurately estimate and classify human motions. We can also construct indexing structures for motion sequences according to their transition trajectories through these linear components. Our method not only provides indices for whole and/or partial motion sequences, but also serves as a compressed representation for the entire motion database. Our method also tends to be immune to temporal variations, and thus avoids the expense of time-warping. SIGMOD Conference Lean middleware. David A. Maluf,David G. Bell,Naveen Ashish 2005 "This paper describes an approach to achieving data integration across multiple sources in an enterprise, in a manner that does not require heavy investment in database and middleware maintenance. This ""lean"" approach to integration leads to cost-effectiveness and scalability of data integration in the enterprise." SIGMOD Conference Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams. Amit Manjhi,Suman Nath,Phillip B. Gibbons 2005 Existing energy-efficient approaches to in-network aggregation in sensor networks can be classified into two categories, tree-based and multi-path-based, with each having unique strengths and weaknesses. In this paper, we introduce Tributary-Delta, a novel approach that combines the advantages of the tree and multi-path approaches by running them simultaneously in different regions of the network. We present schemes for adjusting the regions in response to changes in network conditions, and show how many useful aggregates can be readily computed within this new framework. We then show how a difficult aggregate for this context---finding frequent items---can be efficiently computed within the framework. To this end, we devise the first algorithm for frequent items (and for quantiles) that provably minimizes the worst case total communication for non-regular trees. In addition, we give a multi-path algorithm for frequent items that is considerably more accurate than previous approaches. These algorithms form the basis for our efficient Tributary-Delta frequent items algorithm. Through extensive simulation with real-world and synthetic data, we show the significant advantages of our techniques. For example, in computing Count under realistic loss rates, our techniques reduce answer error by up to a factor of 3 compared to any previous technique. SIGMOD Conference Supporting Executable Mappings in Model Management. Sergey Melnik,Philip A. Bernstein,Alon Y. Halevy,Erhard Rahm 2005 Model management is an approach to simplify the programming of metadata-intensive applications. It offers developers powerful operators, such as Compose, Diff, and Merge, that are applied to models, such as database schemas or interface specifications, and to mappings between models. Prior model management solutions focused on a simple class of mappings that do not have executable semantics. Yet many metadata applications require that mappings be executable, expressed in SQL, XSLT, or other data transformation languages.In this paper, we develop a semantics for model-management operators that allows applying the operators to executable mappings. Our semantics captures previously-proposed desiderata and is language-independent: the effect of the operators is expressed in terms of what they do to the instances of models and mappings. We describe an implemented prototype in which mappings are represented as dependencies between relational schemas, and discuss algebraic optimization of model-management scripts. SIGMOD Conference SMART: a tool for semantic-driven creation of complex XML mappings. "Atsuyuki Morishima,Toshiaki Okawara,Jun'ichi Tanaka,Ken'ichi Ishikawa" 2005 We focus on the problem of data transformations, i.e., how to transform data to another structure to adapt it to new application requirements or given environments. Here, we define data transformation as the process of taking as input two schemas A and B and an instance of A, and producing an instance of B. Today, data transformations are required in many situations: to integrate multiple information sources, to construct and receive data for Web services, and to migrate data from legacy systems to new systems, from local databases to data warehouses. This demonstration focuses on XML transformations, since XML is the de facto standard for data exchange. SIGMOD Conference Research issues in protein location image databases. Robert F. Murphy,Christos Faloutsos 2005 Which proteins have similar locations within cells? How many distinct location patters do cells display? How do we answer these questions quickly, from a large collection of microscope images such as in on-line journals? SIGMOD Conference Towards an enterprise XML architecture. Ravi Murthy,Zhen Hua Liu,Muralidhar Krishnaprasad,Sivasankaran Chandrasekar,Anh-Tuan Tran,Eric Sedlar,Daniela Florescu,Susan Kotsovolos,Nipun Agarwal,Vikas Arora,Viswanathan Krishnamurthy 2005 XML is being increasingly used in diverse domains ranging from data and application integration to content management. Oracle provides an enterprise wide platform for managing all types of XML content. Within the Oracle database and the application server, the XML content can be efficiently stored using a variety of storage and indexing methods and it can be processed using multiple standard languages within different programmatic environments. SIGMOD Conference Optimizing recursive queries in SQL. Carlos Ordonez 2005 Recursion represents an important addition to the SQL language. This work focuses on the optimization of linear recursive queries in SQL. To provide an abstract framework for discussion, we focus on computing the transitive closure of a graph. Three optimizations are studied: (1) Early evaluation of row selection conditions. (2) Eliminating duplicate rows in intermediate tables. (3) Defining an enhanced index to accelerate join computation. Optimizations are evaluated on two types of graphs: binary trees and sparse graphs. Binary trees represent an ideal graph with no cycles and a linear number of edges. Sparse graphs represent an average case with some cycles and a linear number of edges. In general, the proposed optimizations produce a significant reduction in the evaluation time of recursive queries. SIGMOD Conference System RX: One Part Relational, One Part XML. Kevin S. Beyer,Roberta Cochrane,Vanja Josifovski,Jim Kleewein,George Lapis,Guy M. Lohman,Robert Lyle,Fatma Özcan,Hamid Pirahesh,Normen Seemann,Tuong C. Truong,Bert Van der Linden,Brian Vickery,Chun Zhang 2005 This paper describes the overall architecture and design aspects of a hybrid relational and XML database system called System RX. We believe that such a system is fundamental in the evolution of enterprise data management solutions: XML and relational data will co-exist and complement each other in enterprise solutions. Furthermore, a successful XML repository requires much of the same infrastructure that already exists in a relational database management system. Finally, XML query languages have considerable conceptual and functional overlap with relational dataflow engines. System RX is the first truly hybrid system that comingles XML and relational data, giving them equal footing. The new support for XML includes native support for storage and indexing as well as query compilation and evaluation support for the latest industry-standard query languages, SQL/XML and XQuery. By building a hybrid system, we leverage more than 20 years of data management research to advance XML technology to the same standards expected from mature relational systems. SIGMOD Conference Verifying Completeness of Relational Query Results in Data Publishing. HweeHwa Pang,Arpit Jain,Krithi Ramamritham,Kian-Lee Tan 2005 In data publishing, the owner delegates the role of satisfying user queries to a third-party publisher. As the publisher may be untrusted or susceptible to attacks, it could produce incorrect query results. In this paper, we introduce a scheme for users to verify that their query results are complete (i.e., no qualifying tuples are omitted) and authentic (i.e., all the result values originated from the owner). The scheme supports range selection on key and non-key attributes, project as well as join queries on relational databases. Moreover, the proposed scheme complies with access control policies, is computationally secure, and can be implemented efficiently. SIGMOD Conference Conceptual Partitioning: An Efficient Method for Continuous Nearest Neighbor Monitoring. Kyriakos Mouratidis,Marios Hadjieleftheriou,Dimitris Papadias 2005 Conceptual Partitioning: An Efficient Method for Continuous Nearest Neighbor Monitoring. SIGMOD Conference Impact of SOA on enterprise information architectures. Paul Patrick 2005 "Enterprises are looking to find new and cost effective means to leverage existing investments in IT infrastructure and incorporate new capabilities in order to improve business productivity. As a means to improve the integration of applications hosted both internal and external to the enterprise, enterprises are turning to Service Oriented Architectures.In this paper, we describe some of the major aspects associated with the introduction of a Service Oriented Architecture and the impact that it can have on an enterprise's information architecture. We outline the concept of exposing data sources as services and discuss the critical integration aspects that need to be addressed including data access, data transformation, and integration into an over arching enterprise security scheme. The paper suggests alternatives, utilizing a Service Oriented Architecture approach, to promote flexible, extensible, and evolvable information architectures." SIGMOD Conference Relational Confidence Bounds Are Easy With The Bootstrap. Abhijit Pol,Chris Jermaine 2005 "Statistical estimation and approximate query processing have become increasingly prevalent applications for database systems. However, approximation is usually of little use without some sort of guarantee on estimation accuracy, or ""confidence bound."" Analytically deriving probabilistic guarantees for database queries over sampled data is a daunting task, not suitable for the faint of heart, and certainly beyond the expertise of the typical database system end-user. This paper considers the problem of incorporating into a database system a powerful ""plug-in"" method for computing confidence bounds on the answer to relational database queries over sampled or incomplete data. This statistical tool, called the bootstrap, is simple enough that it can be used by a data-base programmer with a rudimentary mathematical background, but general enough that it can be applied to almost any statistical inference problem. Given the power and ease-of-use of the bootstrap, we argue that the algorithms presented for supporting the bootstrap should be incorporated into any database system which is intended to support analytic processing." SIGMOD Conference Events on the edge. Shariq Rizvi,Shawn R. Jeffery,Sailesh Krishnamurthy,Michael J. Franklin,Nathan Burkhart,Anil Edakkunni,Linus Liang 2005 The emergence of large-scale receptor-based systems has enabled applications to execute complex business logic over data generated from monitoring the physical world. An important functionality required by these applications is the detection and response to complex events, often in real-time. Bridging the gap between low-level receptor technology and such high-level needs of applications remains a significant challenge.We demonstrate our solution to this problem in the context of HiFi, a system we are building to solve the data management problems of large-scale receptor-based systems. Specifically, we show how HiFi generates simple events out of receptor data at its edges and provides high-functionality complex event processing mechanisms for sophisticated event detection using a real-world library scenario. SIGMOD Conference Managing structure in bits & pieces: the killer use case for XML. Eric Sedlar 2005 "This paper asserts that for databases to manage a significantly greater percentage of the world's data, managing structural information must get significantly easier. XML technologies provide a widely accepted basis for significant advances in managing data structure. Topics include schema design, evolution, and versioning; managing related applications; and application architecture." SIGMOD Conference BRAID: Stream Mining through Group Lag Correlations. Yasushi Sakurai,Spiros Papadimitriou,Christos Faloutsos 2005 "The goal is to monitor multiple numerical streams, and determine which pairs are correlated with lags, as well as the value of each such lag. Lag correlations (and anti-correlations) are frequent, and very interesting in practice: For example, a decrease in interest rates typically precedes an increase in house sales by a few months; higher amounts of fluoride in the drinking water may lead to fewer dental cavities, some years later. Additional settings include network analysis, sensor monitoring, financial data analysis, and moving object tracking. Such data streams are often correlated (or anti-correlated), but with an unknown lag.We propose BRAID, a method to detect lag correlations between data streams. BRAID can handle data streams of semi-infinite length, incrementally, quickly, and with small resource consumption. We also provide a theoretical analysis, which, based on Nyquist's sampling theorem, shows that BRAID can estimate lag correlations with little, and often with no error at all. Our experiments on real and realistic data show that BRAID detects the correct lag perfectly most of the time (the largest relative error was about 1%); while it is up to 40,000 times faster than the naive implementation." SIGMOD Conference XML and relational database management systems: the inside story. Michael Rys,Donald D. Chamberlin,Daniela Florescu 2005 As XML has evolved from a document markup language to a widely-used format for exchange of structured and semistructured data, managing large amounts of XML data has become increasingly important. A number of companies, including both established database vendors and startups, have recently announced new XML database systems or new XML functionality integrated into existing database systems. This tutorial will provide an insight into how XML functionality fits into relational database management systems as seen by three major relational vendors: IBM, Microsoft and Oracle. SIGMOD Conference Computing for biologists: lessons from some successful case studies. Dennis Shasha 2005 My presentation will be online at the address http://cs.nyu.edu/cs/faculty/shasha/papers/sigmodtut05.ppt in addition to at the SIGMOD site. The presentation discusses computational techniques that have helped biologists, including combinatorial design to support a disciplined experimental design, visualization techniques to display the interaction among multiple inputs, and the discovery of gene function through the search through related species, and others.In this writeup, I confine myself to informal remarks describing both social and technical lessons I have learned while working with biologists. I intersperse these comments with references to relevant papers when appropriate.The tutorial is meant to appeal to researchers and practitioners in databases, data mining, and combinatorial algorithms as well as to natural scientists, especially biologists. SIGMOD Conference Towards Effective Indexing for Very Large Video Sequence Database. Heng Tao Shen,Beng Chin Ooi,Xiaofang Zhou 2005 "With rapid advances in video processing technologies and ever fast increments in network bandwidth, the popularity of video content publishing and sharing has made similarity search an indispensable operation to retrieve videos of user interests. The video similarity is usually measured by the percentage of similar frames shared by two video sequences, and each frame is typically represented as a high-dimensional feature vector. Unfortunately, high complexity of video content has posed the following major challenges for fast retrieval: (a) effective and compact video representations, (b) efficient similarity measurements, and (c) efficient indexing on the compact representations. In this paper, we propose a number of methods to achieve fast similarity search for very large video database. First, each video sequence is summarized into a small number of clusters, each of which contains similar frames and is represented by a novel compact model called Video Triplet (ViTri). ViTri models a cluster as a tightly bounded hypersphere described by its position, radius, and density. The ViTri similarity is measured by the volume of intersection between two hyperspheres multiplying the minimal density, i.e., the estimated number of similar frames shared by two clusters. The total number of similar frames is then estimated to derive the overall similarity between two video sequences. Hence the time complexity of video similarity measure can be reduced greatly. To further reduce the number of similarity computations on ViTris, we introduce a new one dimensional transformation technique which rotates and shifts the original axis system using PCA in such a way that the original inter-distance between two high-dimensional vectors can be maximally retained after mapping. An efficient B+-tree is then built on the transformed one dimensional values of ViTris' positions. Such a transformation enables B+-tree to achieve its optimal performance by quickly filtering a large portion of non-similar ViTris. Our extensive experiments on real large video datasets prove the effectiveness of our proposals that outperform existing methods significantly." SIGMOD Conference Data and metadata management in service-oriented architectures: some open challenges. Vishal Sikka 2005 "Over the last decade, the role of information technology in enterprises has been transforming from one of providing automation services to one of enabling business innovation. IT's charter is now closely aligned with the business goals and processes in a company and to support this charter, enterprise application architecture is shifting towards what's commonly referred to as a services-oriented architecture (SOA), or an enterprise-services architecture [1, 2]. In this talk, I want to discuss the shift to this new architecture and some ramifications of this, in particular some challenges posed by this shift for our research community to pursue." SIGMOD Conference Magnet: Supporting Navigation in Semistructured Data Environments. Vineet Sinha,David R. Karger 2005 "With the growing importance of systems containing arbitrary semi-structured relationships, the need for supporting users searching in such repositories has grown. Currently support for users' search needs either has required domain-specific user interfaces or has required users to be schema experts. We have developed a general-purpose tool that offers users helpful navigation and refinement options for seeking information in these semistructured repositories. We show how a tool can be built without requiring domain-specific assumptions about the information being explored. In addition to describing a general approach to the problem, we provide a set of natural, general-purpose refinement tactics, many generalized from past work on textual information retrieval." SIGMOD Conference RPJ: Producing Fast Join Results on Streams through Rate-based Optimization. Yufei Tao,Man Lung Yiu,Dimitris Papadias,Marios Hadjieleftheriou,Nikos Mamoulis 2005 "We consider the problem of ""progressively"" joining relations whose records are continuously retrieved from remote sources through an unstable network that may incur temporary failures. The objectives are to (i) start reporting the first output tuples as soon as possible (before the participating relations are completely received), and (ii) produce the remaining results at a fast rate. We develop a new algorithm RPJ (Rate-based Progressive Join) based on solid theoretical analysis. RPJ maximizes the output rate by optimizing its execution according to the characteristics of the join relations (e.g., data distribution, tuple arrival pattern, etc.). Extensive experiments prove that our technique delivers results significantly faster than the previous methods." SIGMOD Conference Incremental Maintenance of Path Expression Views. "Arsany Sawires,Jun'ichi Tatemura,Oliver Po,Divyakant Agrawal,K. Selçuk Candan" 2005 Caching data by maintaining materialized views typically requires updating the cache appropriately to reflect dynamic source updates. Extensive research has addressed the problem of incremental view maintenance for relational data but only few works have addressed it for semi-structured data. In this paper we address the problem of incremental maintenance of views defined over XML documents using path-expressions. The approach described in this paper has the following main features that distinguish it from the previous works: (1) The view specification language is powerful and standardized enough to be used in realistic applications. (2) The size of the auxiliary data maintained with the views depends on the expression size and the answer size regardless of the source data size.(3) No source schema is assumed to exist; the source data can be any general well-formed XML document. Experimental evaluation is conducted to assess the performance benefits of the proposed approach. SIGMOD Conference Multiple Aggregations Over Data Streams. Rui Zhang,Nick Koudas,Beng Chin Ooi,Divesh Srivastava 2005 Monitoring aggregates on IP traffic data streams is a compelling application for data stream management systems. The need for exploratory IP traffic data analysis naturally leads to posing related aggregation queries on data streams, that differ only in the choice of grouping attributes. In this paper, we address this problem of efficiently computing multiple aggregations over high speed data streams, based on a two-level LFTA/HFTA DSMS architecture, inspired by Gigascope.Our first contribution is the insight that in such a scenario, additionally computing and maintaining fine-granularity aggregation queries (phantoms) at the LFTA has the benefit of supporting shared computation. Our second contribution is an investigation into the problem of identifying beneficial LFTA configurations of phantoms and user-queries. We formulate this problem as a cost optimization problem, which consists of two sub-optimization problems: how to choose phantoms and how to allocate space for them in the LFTA. We formally show the hardness of determining the optimal configuration, and propose cost greedy heuristics for these independent sub-problems based on detailed analyses. Our final contribution is a thorough experimental study, based on real IP traffic data, as well as synthetic data, to demonstrate the effectiveness of our techniques for identifying beneficial configurations. SIGMOD Conference Event processing with an oracle database. Bob Thome,Dieter Gawlick,Maria Pratt 2005 In this paper, we examine how active database technology developed over the past few years has been put to use to solve real world problems. We note how the technology had to be extended beyond the feature set originally identified in early research to meet these real-world needs, and discuss why this technology was best suited to solving these problems. SIGMOD Conference Foundations of probabilistic answers to queries. Dan Suciu,Nilesh N. Dalvi 2005 Overview: Probabilistic query answering is a fundamental set of techniques that underlies several, very recent database applications: exploratory queries in databases, novel IR-style approaches to data integration, querying information extracted from the Web, queries over sensor networks, data acquisition, querying data sources that violate integrity constraints, controlling information disclosure in data exchange, and reasoning about privacy breaches in data mining. This is a surprisingly diverse range of applications, most of which have either emerged recently, or have seen a recent increased interest, and which all share a common fundamental abstraction: that an item being in the answer to a query is no longer a boolean value, but a probabilistic event. It this authors belief that this is a new paradigm in query answering, whose foundations lie in random graphs, and 0/1-laws in finite model theory. The results from these fields, and their relevance to the probabilistic query answering method, are very little known in the database research community, and the theoretical research papers or books that describe them are not very popular in the systems database research community. SIGMOD Conference Online B-tree Merging. Xiaowei Sun,Rui Wang,Betty Salzberg,Chendong Zou 2005 Many scenarios involve merging of two B-tree indexes, both covering the same key range. Increasing demand for continuous availability and high performance requires that such merging be done online, with minimal interference to normal user transactions. In this paper we present an online B-tree merging method, in which the merging of leaf pages in two B-trees are piggybacked lazily with normal user transactions, thus making the merging I/O efficient and allowing user transactions to access only one index instead of both. The concurrency control mechanism is designed to interfere as little as possible with ongoing user transactions. Merging is made forward recoverable by following a conventional logging protocol, with a few extensions. Should a system failure occur, both indexes being merged can be recovered to a consistent state and no merging work is lost. Experiments and analysis show the I/O savings and the performance, and compare variations on the basic algorithm. SIGMOD Conference CURLER: Finding and Visualizing Nonlinear Correlated Clusters. Anthony K. H. Tung,Xin Xu,Beng Chin Ooi 2005 While much work has been done in finding linear correlation among subsets of features in high-dimensional data, work on detecting nonlinear correlation has been left largely untouched. In this paper, we present an algorithm for finding and visualizing nonlinear correlation clusters in the subspace of high-dimensional databases.Unlike the detection of linear correlation in which clusters are of unique orientations, finding nonlinear correlation clusters of varying orientations requires merging clusters of possibly very different orientations. Combined with the fact that spatial proximity must be judged based on a subset of features that are not originally known, deciding which clusters to be merged during the clustering process becomes a challenge. To avoid this problem, we propose a novel concept called co-sharing level which captures both spatial proximity and cluster orientation when judging similarity between clusters. Based on this concept, we develop an algorithm which not only detects nonlinear correlation clusters but also provides a way to visualize them. Experiments on both synthetic and real-life datasets are done to show the effectiveness of our method. SIGMOD Conference GraphMiner: a structural pattern-mining system for large disk-based graph databases and its applications. Wei Wang,Chen Wang,Yongtai Zhu,Baile Shi,Jian Pei,Xifeng Yan,Jiawei Han 2005 Mining frequent structural patterns from graph databases is an important research problem with broad applications. Recently, we developed an effective index structure, ADI, and efficient algorithms for mining frequent patterns from large, disk-based graph databases [5], as well as constraint-based mining techniques. The techniques have been integrated into a research prototype system--- GraphMiner. In this paper, we describe a demo of GraphMiner which showcases the technical details of the index structure and the mining algorithms including their efficient implementation, the mining performance and the comparison with some state-of-the-art methods, the constraint-based graph-pattern mining techniques and the procedure of constrained graph mining, as well as mining real data sets in novel applications. SIGMOD Conference DogmatiX Tracks down Duplicates in XML. Melanie Weis,Felix Naumann 2005 Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach. SIGMOD Conference Subsequence Matching on Structured Time Series Data. Huanmei Wu,Betty Salzberg,Gregory C. Sharp,Steve B. Jiang,Hiroki Shirato,David R. Kaeli 2005 Subsequence matching in time series databases is a useful technique, with applications in pattern matching, prediction, and rule discovery. Internal structure within the time series data can be used to improve these tasks, and provide important insight into the problem domain. This paper introduces our research effort in using the internal structure of a time series directly in the matching process. This idea is applied to the problem domain of respiratory motion data in cancer radiation treatment. We propose a comprehensive solution for analysis, clustering, and online prediction of respiratory motion using subsequence similarity matching. In this system, a motion signal is captured in real time as a data stream, and is analyzed immediately for treatment and also saved in a database for future study. A piecewise linear representation of the signal is generated from a finite state model, and is used as a query for subsequence matching. To ensure that the query subsequence is representative, we introduce the concept of subsequence stability, which can be used to dynamically adjust the query subsequence length. To satisfy the special needs of similarity matching over breathing patterns, a new subsequence similarity measure is introduced. This new measure uses a weighted L1 distance function to capture the relative importance of each source stream, amplitude, frequency, and proximity in time. From the subsequence similarity measure, stream and patient similarity can be defined, which are then used for offline and online applications. The matching results are analyzed and applied for motion prediction and correlation discovery. While our system has been customized for use in radiation therapy, our approach to time series modeling is general enough for application domains with structured time series data. SIGMOD Conference On Joining and Caching Stochastic Streams. Junyi Xie,Jun Yang,Yuguo Chen 2005 "We consider the problem of joining data streams using limited cache memory, with the goal of producing as many result tuples as possible from the cache. Many cache replacement heuristics have been proposed in the past. Their performance often relies on implicit assumptions about the input streams, e.g., that the join attribute values follow a relatively stationary distribution. However, in general and in practice, streams often exhibit more complex behaviors, such as increasing trends and random walks, rendering these ""hardwired"" heuristics inadequate.In this paper, we propose a framework that is able to exploit known or observed statistical properties of input streams to make cache replacement decisions aimed at maximizing the expected number of result tuples. To illustrate the complexity of the solution space, we show that even an algorithm that considers, at every time step, all possible sequences of future replacement decisions may not be optimal. We then identify a condition between two candidate tuples under which an optimal algorithm would always choose one tuple over the other to replace. We develop a heuristic that behaves consistently with an optimal algorithm whenever this condition is satisfied. We show through experiments that our heuristic outperforms previous ones.As another evidence of the generality of our framework, we show that the classic caching/paging problem for static objects can be reduced to a stream join problem and analyzed under our framework, yielding results that agree with or extend classic ones." SIGMOD Conference Substructure Similarity Search in Graph Databases. Xifeng Yan,Philip S. Yu,Jiawei Han 2005 Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, chem-informatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently.In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate non-consecutive sequences, trees, and other complicated structures as well. SIGMOD Conference Similarity Evaluation on Tree-structured Data. Rui Yang,Panos Kalnis,Anthony K. H. Tung 2005 Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the tree edit distance. In this paper, we propose to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information. We prove that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees. Based on the theoretical analysis, we describe a novel algorithm which embeds the proposed distance into a filter-and-refine framework to process similarity search on tree-structured data. The experimental results show that our algorithm reduces dramatically the distance computation cost. Our method is especially suitable for accelerating similarity query processing on large trees in massive datasets. SIGMOD Conference Modeling and querying multidimensional data sources in Siebel Analytics: a federated relational system. Kazi A. Zaman,Donovan A. Schneider 2005 Large organizations have a multitude of data sources across the enterprise and want to obtain business value from all of them. While the majority of these data sources may be consolidated in an enterprise data warehouse, many business units have their own data marts where analysis is carried out against data stored in multidimensional data structures. It is often critical to pose queries which span both these sources. This is a challenge since these sources have differing models and query languages (SQL vs MDX). The Siebel Analytics Server enables this requirement to be fulfilled. In this paper, we describe how the multidimensional metadata is modeled relationally within Siebel Analytics, efficient SQL to MDX translation algorithms and the conversion protocols required to convert a multidimensional result into a relational rowset. SIGMOD Conference Mining Periodic Patterns with Gap Requirement from Sequences. Minghua Zhang,Ben Kao,David Wai-Lok Cheung,Kevin Y. Yip 2005 We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem, and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study. SIGMOD Conference TriCluster: An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data. Lizhuang Zhao,Mohammed Javeed Zaki 2005 In this paper we introduce a novel algorithm called TRICLUSTER, for mining coherent clusters in three-dimensional (3D) gene expression datasets. TRICLUSTER can mine arbitrarily positioned and overlapping clusters, and depending on different parameter values, it can mine different types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. TRICLUSTER relies on graph-based approach to mine all valid clusters. For each time slice, i.e., a gene×sample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of bi-clusters for this time slice. Then TRICLUSTER constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the final set of triclusters. Optionally, TRICLUSTER merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that TRICLUSTER can find significant triclusters in the real microarray datasets. SIGMOD Conference Fossilized Index: The Linchpin of Trustworthy Non-Alterable Electronic Records. Qingbo Zhu,Windsor W. Hsu 2005 As critical records are increasingly stored in electronic form, which tends to make for easy destruction and clandestine modification, it is imperative that they be properly managed to preserve their trustworthiness, i.e., their ability to provide irrefutable proof and accurate details of events that have occurred. The need for proper record keeping is further underscored by the recent corporate misconduct and ensuing attempts to destroy incriminating records. Currently, the industry practice and regulatory requirements (e.g., SEC Rule 17a-4) rely on storing records in WORM storage to immutably preserve the records. In this paper, we contend that simply storing records in WORM storage is increasingly inadequate to ensure that they are trustworthy. Specifically, with the large volume of records that are typical today, meeting the ever more stringent query response time requires the use of direct access mechanisms such as indexes. Relying on indexes for accessing records could, however, provide a means for effectively altering or deleting records, even those stored in WORM storage.In this paper, we establish the key requirements for a fossilized index that protects the records from such logical modification. We also analyze current indexing methods to determine how they fall short of these requirements. Based on our insights, we propose the Generalized Hash Tree (GHT). Using both theoretical analysis and simulations with real system data, we demonstrate that the GHT can satisfy the requirements of a fossilized index with performance and cost that are comparable to regular indexing techniques such as the B-tree. We further note that as records are indexed on multiple fields to facilitate search and retrieval, the records can be reconstructed from the corresponding index entries even after the records expire and are disposed of, Therefore, we also present a novel method to eliminate this disclosure risk by allowing an index entry to be effectively disposed of when its record expires. SIGMOD Conference Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005 Fatma Özcan 2005 Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005 VLDB XML Full-Text Search: Challenges and Opportunities. Sihem Amer-Yahia,Jayavel Shanmugasundaram 2005 An ever growing number of XML repositories are being made available for search. A lot of activity has been deployed in the past few years to query such repositories. In particular, full-text querying of text-rich XML documents has generated a wealth of issues that are being addressed by both the database (DB) and information retrieval (IR) communities. The DB community has traditionally focused on developing query languages and efficient evaluation algorithms for highly structured data. In contrast, the IR community has focused on searching unstructured data, and has developed various techniques for ranking query results and evaluating their effectiveness. Fortunately, recent trends in DB and IR research demonstrate a growing interest in adopting IR techniques in DBs and vice versa [1, 2, 3, 4, 5, 6, 7, 9]. VLDB ULoad: Choosing the Right Storage for Your XML Application. Andrei Arion,Véronique Benzaken,Ioana Manolescu,Ravi Vijay 2005 A key factor for the outstanding success of database management systems is physical data independence: queries, and application programs, are able to refer to the data at the logical level, ignoring the details on how the data is physically stored and accessed by the system. The corner stone of implementing physical data independence is an access path selection algorithm: whenever a disk-resident data item can be accessed in several ways, the access path selection algorithm, which is part of the query optimizer, will identify the possible alternatives, and choose the one likely to provide the best performance for a given query [13]. VLDB REED: Robust, Efficient Filtering and Event Detection in Sensor Networks. Daniel J. Abadi,Samuel Madden,Wolfgang Lindner 2005 This paper presents a set of algorithms for efficiently evaluating join queries over static data tables in sensor networks. We describe and evaluate three algorithms that take advantage of distributed join techniques. Our algorithms are capable of running in limited amounts of RAM, can distribute the storage burden over groups of nodes, and are tolerant to dropped packets and node failures. REED is thus suitable for a wide range of event-detection applications that traditional sensor network database and data collection systems cannot be used to implement. VLDB Semantic Overlay Networks. Karl Aberer,Philippe Cudré-Mauroux 2005 In a handful of years only, Peer-to-Peer (P2P) systems have become an integral part of the Internet. After a few key successes related to music-sharing (e.g., Napster or Gnutella), they rapidly developed and are nowadays firmly established in various contexts, ranging from large-scale content distribution (Bit Torrent) to Internet telephony(Skype) or networking platforms (JXTA). The main idea behind P2P is to leverage on the power of end-computers: Instead of relying on central components (e.g., servers), services are powered by decentralized overlay architectures where end-computers connect to each other dynamically. VLDB Indexing Data-oriented Overlay Networks. Karl Aberer,Anwitaman Datta,Manfred Hauswirth,Roman Schmidt 2005 The application of structured overlay networks to implement index structures for data-oriented applications such as peer-to-peer databases or peer-to-peer information retrieval, requires highly efficient approaches for overlay construction, as changing application requirements frequently lead to re-indexing of the data and hence (re)construction of overlay networks. This problem has so far not been addressed in the literature and thus we describe an approach for the efficient construction of data-oriented, structured overlay networks from scratch in a self-organized way. Standard maintenance algorithms for overlay networks cannot accomplish this efficiently, as they are inherently sequential. Our proposed algorithm is completely decentralized, parallel, and can construct a new overlay network with short latency. At the same time it ensures good load-balancing for skewed data key distributions which result from preserving key order relationships as necessitated by data-oriented applications. We provide both a theoretical analysis of the basic algorithms and a complete system implementation that has been tested on PlanetLab. We use this implementation to support peer-to-peer information retrieval and database applications. VLDB Fine-Grained Replication and Scheduling with Freshness and Correctness Guarantees. Fuat Akal,Can Türker,Hans-Jörg Schek,Yuri Breitbart,Torsten Grabs,Lourens Veen 2005 Lazy replication protocols provide good scalability properties by decoupling transaction execution from the propagation of new values to replica sites while guaranteeing a correct and more efficient transaction processing and replica maintenance. However, they impose several restrictions that are often not valid in practical database settings, e.g., they require that each transaction executes at its initiation site and/or are restricted to full replication schemes. Also, the protocols cannot guarantee that the transactions will always see the freshest available replicas. This paper presents a new lazy replication protocol called PDBREP that is free of these restrictions while ensuring one-copy-serializable executions. The protocol exploits the distinction between read-only and update transactions and works with arbitrary physical data organizations such as partitioning and striping as well as different replica granularities. It does not require that each read-only transaction executes entirely at its initiation site. Hence, each read-only site need not contain a fully replicated database. PDBREP moreover generalizes the notion of freshness to finer data granules than entire databases. VLDB On k-Anonymity and the Curse of Dimensionality. Charu C. Aggarwal 2005 In recent years, the wide availability of personal data has made the problem of privacy preserving data mining an important one. A number of methods have recently been proposed for privacy preserving data mining of multidimensional data records. One of the methods for privacy preserving data mining is that of anonymization, in which a record is released only if it is indistinguishable from k other entities in the data. We note that methods such as k-anonymity are highly dependent upon spatial locality in order to effectively implement the technique in a statistically robust way. In high dimensional space the data becomes sparse, and the concept of spatial locality is no longer easy to define from an application point of view. In this paper, we view the k-anonymization problem from the perspective of inference attacks over all possible combinations of attributes. We show that when the data contains a large number of attributes which may be considered quasi-identifiers, it becomes difficult to anonymize the data without an unacceptably high amount of information loss. This is because an exponential number of combinations of dimensions can be used to make precise inference attacks, even when individual attributes are partially specified within a range. We provide an analysis of the effect of dimensionality on k-anonymity methods. We conclude that when a data set contains a large number of attributes which are open to inference attacks, we are faced with a choice of either completely suppressing most of the data or losing the desired level of anonymity. Thus, this paper shows that the curse of high dimensionality also applies to the problem of privacy preserving data mining. VLDB NILE-PDT: A Phenomenon Detection and Tracking Framework for Data Stream Management Systems. Mohamed H. Ali,Walid G. Aref,Raja Bose,Ahmed K. Elmagarmid,Abdelsalam Helal,Ibrahim Kamel,Mohamed F. Mokbel 2005 In this demo, we present Nile-PDT, a Phenomenon Detection and Tracking framework using the Nile data stream management system. A phenomenon is characterized by a group of streams showing similar behavior over a period of time. The functionalities of Nile-PDT is split between the Nile server and the Nile-PDT application client. At the server side, Nile detects phenomenon candidate members and tracks their propagation incrementally through specific sensor network operators. Phenomenon candidate members are processed at the client side to detect phenomena of interest to a particular application. Nile-PDT is scalable in the number of sensors, the sensor data rates, and the number of phenomena. Guided by the detected phenomena, Nile-PDT tunes query processing towards sensors that heavily affect the monitoring of phenomenon propagation. VLDB Interactive Schema Translation with Instance-Level Mappings. Philip A. Bernstein,Sergey Melnik,Peter Mork 2005 We demonstrate a prototype that translates schemas from a source metamodel (e.g., OO, relational, XML) to a target metamodel. The prototype is integrated with Microsoft Visual Studio 2005 to generate relational schemas from an object-oriented design. It has four novel features. First, it produces instance mappings to round-trip the data between the source schema and the generated target schema. It compiles the instance mappings into SQL views to reassemble the objects stored in relational tables. Second, it offers interactive editing, i.e., incremental modifications of the source schema yield incremental modifications of the target schema. Third, it incorporates a novel mechanism for mapping inheritance hierarchies to relations, which supports all known strategies and their combinations. Fourth, it is integrated with a commercial product featuring a high-quality user interface. The schema translation process is driven by high-level rules that eliminate constructs that are absent from the target metamodel. VLDB Personalizing XML Text Search in PimenT. Sihem Amer-Yahia,Irini Fundulaki,Prateek Jain,Laks V. S. Lakshmanan 2005 "A growing number of text-rich XML repositories are being made available. As a result, more efforts have been deployed to provide XML full-text search that combines querying structure with complex conditions on text ranging from simple keyword search to sophisticated proximity search composed with stemming and thesaurus. However, one of the key challenges in full-text search is to match users' expectations and determine the most relevant answers to a full-text query. In this context, we propose query personalization as a way to take user profiles into account in order to customize query answers based on individual users' needs.We present PIMENT, a system that enables query personalization by query rewriting and answer ranking. PIMENT is composed of a profile repository that stores user profiles, a query customizer that rewrites user queries based on user profiles and, a ranking module to rank query answers." VLDB Structure and Content Scoring for XML. Sihem Amer-Yahia,Nick Koudas,Amélie Marian,Divesh Srivastava,David Toman 2005 XML repositories are usually queried both on structure and content. Due to structural heterogeneity of XML, queries are often interpreted approximately and their answers are returned ranked by scores. Computing answer scores in XML is an active area of research that oscillates between pure content scoring such as the well-known tf*idf and taking structure into account. However, none of the existing proposals fully accounts for structure and combines it with content to score query answers. We propose novel XML scoring methods that are inspired by tf*idf and that account for both structure and content while considering query relaxations. Twig scoring, accounts for the most structure and content and is thus used as our reference method. Path scoring is an approximation that loosens correlations between query nodes hence reducing the amount of time required to manipulate scores during top-k query processing. We propose efficient data structures in order to speed up ranked query processing. We run extensive experiments that validate our scoring methods and that show that path scoring provides very high precision while improving score computation time. VLDB Approximate Matching of Hierarchical Data Using pq-Grams. Nikolaus Augsten,Michael H. Böhlen,Johann Gamper 2005 When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach. VLDB Designing Information-Preserving Mapping Schemes for XML. Denilson Barbosa,Juliana Freire,Alberto O. Mendelzon 2005 An XML-to-relational mapping scheme consists of a procedure for shredding documents into relational databases, a procedure for publishing databases back as documents, and a set of constraints the databases must satisfy. In previous work, we defined two notions of information preservation for mapping schemes: losslessness, which guarantees that any document can be reconstructed from its corresponding database; and validation, which requires every legal database to correspond to a valid document. We also described one information-preserving mapping scheme, called Edge++, and showed that, under reasonable assumptions, losslessness and validation are both undecidable. This leads to the question we study in this paper: how to design mapping schemes that are information-preserving. We propose to do it by starting with a scheme known to be information-preserving and applying to it equivalence-preserving transformations written in weakly recursive ILOG. We study an instance of this framework, the LILO algorithm, and show that it provides significant performance improvements over Edge++ and introduces constraints that are efficiently enforced in practice. VLDB Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods. Attila Barta,Mariano P. Consens,Alberto O. Mendelzon 2005 We compare several optimization strategies implemented in an XML query evaluation system. The strategies incorporate the use of path summaries into the query optimizer, and rely on heuristics that exploit data statistics.We present experimental results that demonstrate a wide range of performance improvements for the different strategies supported. In addition, we compare the speedups obtained using path summaries with those reported for index-based methods. The comparison shows that low-cost path summaries combined with optimization strategies achieve essentially the same benefits as more expensive index structures. VLDB Automatic Data Fusion with HumMer. Alexander Bilke,Jens Bleiholder,Christoph Böhm,Karsten Draba,Felix Naumann,Melanie Weis 2005 Automatic Data Fusion with HumMer. VLDB Database Publication Practices. Philip A. Bernstein,David J. DeWitt,Andreas Heuer,Zachary G. Ives,Christian S. Jensen,Holger Meyer,M. Tamer Özsu,Richard T. Snodgrass,Kyu-Young Whang,Jennifer Widom 2005 There has been a growing interest in improving the publication processes for database research papers. This panel reports on recent changes in those processes and presents an initial cut at historical data for the VLDB Journal and ACM Transactions on Database Systems. VLDB MINERVA: Collaborative P2P Search. Matthias Bender,Sebastian Michel,Peter Triantafillou,Gerhard Weikum,Christian Zimmer 2005 This paper proposes the live demonstration of a prototype of MINERVA, a novel P2P Web search engine. The search engine is layered on top of a DHT-based overlay network that connects an a-priori unlimited number of peers, each of which maintains a personal local database and a local search facility. Each peer posts a small amount of metadata to a physically distributed directory that is used to efficiently select promising peers from across the peer population that can best locally execute a query. The proposed demonstration serves as a proof of concept for P2P Web search by deploying the project on standard notebook PCs and also invites everybody to join the network by instantly installing a small piece of software from a USB memory stick. VLDB Content-Based Routing: Different Plans for Different Data. Pedro Bizarro,Shivnath Babu,David J. DeWitt,Jennifer Widom 2005 Query optimizers in current database systems are designed to pick a single efficient plan for a given query based on current statistical properties of the data. However, different subsets of the data can sometimes have very different statistical properties. In such scenarios it can be more efficient to process different subsets of the data for a query using different plans. We propose a new query processing technique called content-based routing (CBR) that eliminates the single-plan restriction in current systems. We present low-overhead adaptive algorithms that partition input data based on statistical properties relevant to query execution strategies, and efficiently route individual tuples through customized plans based on their partition. We have implemented CBR as an extension to the Eddies query processor in the TelegraphCQ system, and we present an extensive experimental evaluation showing the significant performance benefits of CBR. VLDB Automatic Composition of Transition-based Semantic Web Services with Messaging. Daniela Berardi,Diego Calvanese,Giuseppe De Giacomo,Richard Hull,Massimo Mecella 2005 "In this paper we present Colombo, a framework in which web services are characterized in terms of (i) the atomic processes (i.e., operations) they can perform; (ii) their impact on the ""real world"" (modeled as a relational database); (iii) their transition-based behavior; and (iv) the messages they can send and receive (from/to other web services and ""human"" clients). As such, Colombo combines key elements from the standards and research literature on (semantic) web services. Using Colombo, we study the problem of automatic service composition (synthesis) and devise a sound, complete and terminating algorithm for building a composite service. Specifically, the paper develops (i) a technique for handling the data, which ranges over an infinite domain, in a finite, symbolic way, and (ii) a technique to automatically synthesize composite web services, based on Propositional Dynamic Logic." VLDB Efficient Evaluation of XQuery over Streaming Data. Xiaogang Li,Gagan Agrawal 2005 With the growing popularity of XML and emergence of streaming data model, processing queries over streaming XML has become an important topic. This paper presents a new framework and a set of techniques for processing XQuery over streaming data. As compared to the existing work on supporting XPath/XQuery over data streams, we make the following three contributions:1. We propose a series of optimizations which transform XQuery queries so that they can be correctly executed with a single pass on the dataset.2. We present a methodology for determining when an XQuery query, possibly after the transformations we introduce, can be correctly executed with only a single pass on the dataset.3. We describe a code generation approach which can handle XQuery queries with user-defined aggregates, including recursive functions. We aggressively use static analysis and generate executable code, i.e., do not require a query plan to be interpreted at runtime.We have evaluated our implementation using several XMark benchmarks and three other XQuery queries driven by real applications. Our experimental results show that as compared to Qizx/Open, Saxon, and Galax, our system: 1) is at least 25% faster on XMark queries with small datasets, 2) is significantly faster on XMark queries with larger datasets, 3) at least one order of magnitude faster on the queries driven by real applications, as unlike other systems, we can transform them to execute with a single pass, and 4) executes queries efficiently on large datasets when other systems often have memory overflows. VLDB Information Preserving XML Schema Embedding. Philip Bohannon,Wenfei Fan,Michael Flaster,P. P. S. Narayan 2005 A fundamental concern of information integration in an XML context is the ability to embed one or more source documents in a target document so that (a) the target document conforms to a target schema and (b) the information in the source document(s) is preserved. In this paper, information preservation for XML is formally studied, and the results of this study guide the definition of a novel notion of schema embedding between two XML DTD schemas represented as graphs. Schema embedding generalizes the conventional notion of graph similarity by allowing an edge in a source DTD schema to be mapped to a path in the target DTD. Instance-level embeddings can be defined from the schema embedding in a straightforward manner, such that conformance to a target schema and information preservation are guaranteed. We show that it is NP-complete to find an embedding between two DTD schemas. We also provide efficient heuristic algorithms to find candidate embeddings, along with experimental results to evaluate and compare the algorithms. These yield the first systematic and effective approach to finding information preserving XML mappings. VLDB Pathfinder: XQuery - The Relational Way. Peter A. Boncz,Torsten Grust,Maurice van Keulen,Stefan Manegold,Jan Rittinger,Jens Teubner 2005 "Relational query processors are probably the best understood (as well as the best engineered) query engines available today. Although carefully tuned to process instances of the relational model (tables of tuples), these processors can also provide a foundation for the evaluation of ""alien"" (non-relational) query languages: if a relational encoding of the alien data model and its associated query language is given, the RDBMS may act like a special-purpose processor for the new language." VLDB HePToX: Marrying XML and Heterogeneity in Your P2P Databases. Angela Bonifati,Elaine Qing Chang,Terence Ho,Laks V. S. Lakshmanan,Rachel Pottinger 2005 HePToX: Marrying XML and Heterogeneity in Your P2P Databases. VLDB On Map-Matching Vehicle Tracking Data. Sotiris Brakatsoulas,Dieter Pfoser,Randall Salas,Carola Wenk 2005 "Vehicle tracking data is an essential ""raw"" material for a broad range of applications such as traffic management and control, routing, and navigation. An important issue with this data is its accuracy. The method of sampling vehicular movement using GPS is affected by two error sources and consequently produces inaccurate trajectory data. To become useful, the data has to be related to the underlying road network by means of map matching algorithms. We present three such algorithms that consider especially the trajectory nature of the data rather than simply the current position as in the typical map-matching case. An incremental algorithm is proposed that matches consecutive portions of the trajectory to the road network, effectively trading accuracy for speed of computation. In contrast, the two global algorithms compare the entire trajectory to candidate paths in the road network. The algorithms are evaluated in terms of (i) their running time and (ii) the quality of their matching result. Two novel quality measures utilizing the Fréchet distance are introduced and subsequently used in an experimental evaluation to assess the quality of matching real tracking data to a road network." VLDB OLAP Over Uncertain and Imprecise Data. Douglas Burdick,Prasad Deshpande,T. S. Jayram,Raghu Ramakrishnan,Shivakumar Vaithyanathan 2005 We extend the OLAP data model to represent data ambiguity, specifically imprecision and uncertainty, and introduce an allocation-based approach to the semantics of aggregation queries over such data. We identify three natural query properties and use them to shed light on alternative query semantics. While there is much work on representing and querying ambiguous data, to our knowledge this is the first paper to handle both imprecision and uncertainty in an OLAP setting. VLDB PSYCHO: A Prototype System for Pattern Management. Barbara Catania,Anna Maddalena,Maurizio Mazza 2005 Patterns represent in a compact and rich in semantics way huge quantity of heterogeneous data. Due to their characteristics, specific systems are required for pattern management, in order to model and manipulate patterns, with a possibly user-defined structure, in an efficient and effective way. In this demonstration we present PSYCHO, a pattern based management system prototype. PSYCHO allows the user to: (i) use standard pattern types or define new ones; (ii) generate or import patterns, represented according to existing standards; (iii) manipulate possibly heterogeneous patterns under an integrated environment. VLDB Flexible Database Generators. Nicolas Bruno,Surajit Chaudhuri 2005 Evaluation and applicability of many database techniques, ranging from access methods, histograms, and optimization strategies to data normalization and mining, crucially depend on their ability to cope with varying data distributions in a robust way. However, comprehensive real data is often hard to come by, and there is no flexible data generation framework capable of modelling varying rich data distributions. This has led individual researchers to develop their own ad-hoc data generators for specific tasks. As a consequence, the resulting data distributions and query workloads are often hard to reproduce, analyze, and modify, thus preventing their wider usage. In this paper we present a flexible, easy to use, and scalable framework for database generation. We then discuss how to map several proposed synthetic distributions to our framework and report preliminary results. VLDB MDL Summarization with Holes. Shaofeng Bu,Laks V. S. Lakshmanan,Raymond T. Ng 2005 Summarization of query results is an important problem for many OLAP applications. The Minimum Description Length principle has been applied in various studies to provide summaries. In this paper, we consider a new approach of applying the MDL principle. We study the problem of finding summaries of the form S Θ H for k-d cubes with tree hierarchies. The S part generalizes the query results, while the H part describes all the exceptions to the generalizations. The optimization problem is to minimize the combined cardinalities of S and H. We first characterize the problem by showing that solving the 1-d problem can be done in time linear to the size of hierarchy, but solving the 2-d problem is NP-hard. We then develop three different heuristics, based on a greedy approach, a dynamic programming approach and a quadratic programming approach. We conduct a comprehensive experimental evaluation. Both the dynamic programming algorithm and the greedy algorithm can be used for different circumstances. Both produce summaries that are significantly shorter than those generated by state-of-the-art alternatives. VLDB Loadstar: Load Shedding in Data Stream Mining. Yun Chi,Haixun Wang,Philip S. Yu 2005 In this demo, we show that intelligent load shedding is essential in achieving optimum results in mining data streams under various resource constraints. The Loadstar system introduces load shedding techniques to classifying multiple data streams of large volume and high speed. Loadstar uses a novel metric known as the quality of decision (QoD) to measure the level of uncertainty in classification. Resources are then allocated to sources where uncertainty is high. To make optimum classification decisions and accurate QoD measurement, Loadstar relies on feature prediction to model the data dropped by the load shedding mechanism. Furthermore, Loadstar is able to adapt to the changing data characteristics in data streams. The system thus offers a nice solution to data mining with resource constraints. VLDB Adaptive Stream Filters for Entity-based Queries with Non-Value Tolerance. Reynold Cheng,Ben Kao,Sunil Prabhakar,Alan Kwan,Yi-Cheng Tu 2005 We study the problem of applying adaptive filters for approximate query processing in a distributed stream environment. We propose filter bound assignment protocols with the objective of reducing communication cost. Most previous works focus on value-based queries (e.g., average) with numerical error tolerance. In this paper, we cover entity-based queries (e.g., nearest neighbor) with non-value-based error tolerance. We investigate different non-value-based error tolerance definitions and discuss how they are applied to two classes of entity-based queries: non-rank-based and rank-based queries. Extensive experiments show that our protocols achieve significant savings in both communication overhead and server computation. VLDB An Efficient and Scalable Approach to CNN Queries in a Road Network. Hyung-Ju Cho,Chin-Wan Chung 2005 "A continuous search in a road network retrieves the objects which satisfy a query condition at any point on a path. For example, return the three nearest restaurants from all locations on my route from point s to point e. In this paper, we deal with NN queries as well as continuous NN queries in the context of moving objects databases. The performance of existing approaches based on the network distance such as the shortest path length depends largely on the density of objects of interest. To overcome this problem, we propose UNICONS (a unique continuous search algorithm) for NN queries and CNN queries performed on a network. We incorporate the use of precomputed NN lists into Dijkstra's algorithm for NN queries. A mathematical rationale is employed to produce the final results of CNN queries. Experimental results for real-life datasets of various sizes show that UNICONS outperforms its competitors by up to 3.5 times for NN queries and 5 times for CNN queries depending on the density of objects and the number of NNs required." VLDB MIX: A Meta-data Indexing System for XML. SungRan Cho,Nick Koudas,Divesh Srivastava 2005 We present a system for efficient meta-data indexed querying of XML documents. Given the diversity of the information available in XML, it is very useful to annotate XML data with a wide variety of meta-data, such as quality and security assessments. We address the meta-data indexing problem of efficiently identifying the XML elements along a location step in an XPath query, that satisfy meta-data range constraints. Our system, named MIX, incorporates query processing on all XPath axes suitably enhanced with meta-data features offering not only query answering but also dynamic maintenance of meta-data levels for XML documents. VLDB An Efficient SQL-based RDF Querying Scheme. Eugene Inseok Chong,Souripriya Das,George Eadon,Jagannathan Srinivasan 2005 "Devising a scheme for efficient and scalable querying of Resource Description Framework (RDF) data has been an active area of current research. However, most approaches define new languages for querying RDF data, which has the following shortcomings: 1) They are difficult to integrate with SQL queries used in database applications, and 2) They incur inefficiency as data has to be transformed from SQL to the corresponding language data format. This paper proposes a SQL based scheme that avoids these problems. Specifically, it introduces a SQL table function RDF_MATCH to query RDF data. The results of RDF_MATCH table function can be further processed by SQL's rich querying capabilities and seamlessly combined with queries on traditional relational data. Furthermore, the RDF_MATCH table function invocation is rewritten as a SQL query, thereby avoiding run-time table function procedural overheads. It also enables optimization of rewritten query in conjunction with the rest of the query. The resulting query is executed efficiently by making use of B-tree indexes as well as specialized subject-property materialized views. This paper describes the functionality of the RDF_MATCH table function for querying RDF data, which can optionally include user-defined rulebases, and discusses its implementation in Oracle RDBMS. It also presents an experimental study characterizing the overhead eliminated by avoiding procedural code at runtime, characterizing performance under various input conditions, and demonstrating scalability using 80 million RDF triples from UniProt protein and annotation data." VLDB U-DBMS: A Database System for Managing Constantly-Evolving Data. Reynold Cheng,Sarvjeet Singh,Sunil Prabhakar 2005 In many systems, sensors are used to acquire information from external environments such as temperature, pressure and locations. Due to continuous changes in these values, and limited resources (e.g., network bandwidth and battery power), it is often infeasible for the database to store the exact values at all times. Queries that uses these old values can produce invalid results. In order to manage the uncertainty between the actual sensor value and the database value, we propose a system called U-DBMS. U-DBMS extends the database system with uncertainty management functionalities. In particular, each data value is represented as an interval and a probability distribution function, and it can be processed with probabilistic query operators to produce imprecise (but correct) answers. This demonstration presents a PostgreSQL-based system that handles uncertainty and probabilistic queries for constantly-evolving data. VLDB Inspector Joins. Shimin Chen,Anastassia Ailamaki,Phillip B. Gibbons,Todd C. Mowry 2005 The key idea behind Inspector Joins is that during the I/O partitioning phase of a hash-based join, we have the opportunity to look at the actual data itself and then use this knowledge in two ways: (1) to create specialized indexes, specific to the given query on the given data, for optimizing the CPU cache performance of the subsequent join phase of the algorithm, and (2) to decide which join phase algorithm best suits this specific query. We show how inspector joins, employing novel statistics and specialized indexes, match or exceed the performance of state-of-the-art cache-friendly hash join algorithms. For example, when run on eight or more processors, our experiments show that inspector joins offer 1.1-1.4X speedups over these previous algorithms, with the speedup increasing as the number of processors increases. VLDB Optimistic Intra-Transaction Parallelism on Chip Multiprocessors. Christopher B. Colohan,Anastassia Ailamaki,J. Gregory Steffan,Todd C. Mowry 2005 With the advent of chip multiprocessors, exploiting intra-transaction parallelism is an attractive way of improving transaction performance. However, exploiting intra-transaction parallelism in existing database systems is difficult, for two reasons: first, significant changes are required to avoid races or conflicts within the DBMS, and second, adding threads to transactions requires a high level of sophistication from transaction programmers. In this paper we show how dividing a transaction into speculative threads solves both problems---it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer. Our technique requires a limited number of small, localized changes to a subset of the low-level data structures in the DBMS. Through this method of parallelizing transactions we can dramatically improve performance: on a simulated 4-processor chip-multiprocessor, we improve the response time by 36-74% for three of the five TPC-C transactions. VLDB Prediction Cubes. Bee-Chung Chen,Lei Chen,Yi Lin,Raghu Ramakrishnan 2005 In this paper, we introduce a new family of tools for exploratory data analysis, called prediction cubes. As in standard OLAP data cubes, each cell in a prediction cube contains a value that summarizes the data belonging to that cell, and the granularity of cells can be changed via operations such as roll-up and drill-down. In contrast to data cubes, in which each cell value is computed by an aggregate function, e.g., SUM or AVG, each cell value in a prediction cube summarizes a predictive model trained on the data corresponding to that cell, and characterizes its decision behavior or predictiveness. In this paper, we propose and motivate prediction cubes, and show that they can be efficiently computed by exploiting the idea of model decomposition. VLDB Stack-based Algorithms for Pattern Matching on DAGs. Li Chen,Amarnath Gupta,M. Erdem Kurul 2005 Existing work for query processing over graph data models often relies on pre-computing the transitive closure or path indexes. In this paper, we propose a family of stack-based algorithms to handle path, twig, and dag pattern queries for directed acyclic graphs (DAGs) in particular. Our algorithms do not precompute the transitive closure nor path indexes for a given graph, however they achieve an optimal runtime complexity quadratic in the average size of the query variable bindings. We prove the soundness and completeness of our algorithms and present the experimental results. VLDB Sketching Streams Through the Net: Distributed Approximate Query Tracking. Graham Cormode,Minos N. Garofalakis 2005 Emerging large-scale monitoring applications require continuous tracking of complex data-analysis queries over collections of physically-distributed streams. Effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality approximate query answers. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking a broad class of complex aggregate queries in such a distributed-streams setting. Our tracking schemes maintain approximate query answers with provable error guarantees, while simultaneously optimizing the storage space and processing time at each remote site, and the communication cost across the network. They rely on tracking general-purpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication- and space/time-efficient solutions. The result is a powerful approximate query tracking framework that readily incorporates several complex analysis queries (including distributed join and multi-join aggregates, and approximate wavelet representations), thus giving the first known low-overhead tracking solution for such queries in the distributed-streams model. VLDB Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. Graham Cormode,S. Muthukrishnan,Irina Rozenbaum 2005 "Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be ""viewed"" in different ways. A data stream of integer values can be viewed either as the forward distribution f (x), ie., the number of occurrences of x in the stream, or as its inverse, f-1 (i), which is the number of items that appear i times. While both such ""views"" are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions.We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods.We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams." VLDB Rewriting XPath Queries Using Materialized Views. Wanhong Xu,Z. Meral Özsoyoglu 2005 Rewriting XPath Queries Using Materialized Views. VLDB Answering Queries from Statistics and Probabilistic Views. Nilesh N. Dalvi,Dan Suciu 2005 Systems integrating dozens of databases, in the scientific domain or in a large corporation, need to cope with a wide variety of imprecisions, such as: different representations of the same object in different sources; imperfect and noisy schema alignments; contradictory information across sources; constraint violations; or insufficient evidence to answer a given query. If standard query semantics were applied to such data, all but the most trivial queries will return an empty answer. VLDB Bridging the Gap between OLAP and SQL. Jens-Peter Dittrich,Donald Kossmann,Alexander Kreutz 2005 In the last ten years, database vendors have invested heavily in order to extend their products with new features for decision support. Examples of functionality that has been added are top N [2], ranking [13, 7], spreadsheet computations [19], grouping sets [14], data cube [9], and moving sums [15] in order to name just a few. Unfortunately, many modern OLAP systems do not use that functionality or replicate a great deal of it in addition to other database-related functionality. In fact, the gap between the functionality provided by an OLAP system and the functionality used from the underlying database systems has widened in the past, rather than narrowed. The reasons for this trend are that SQL as a data definition and query language, the relational model, and the client/server architecture of the current generation of database products have fundamental shortcomings for OLAP. This paper lists these deficiencies and presents the BTell OLAP engine as an example on how to bridge these shortcomings. In addition, we discuss how to extend current DBMS to better support OLAP in the future. VLDB iMeMex: Escapes from the Personal Information Jungle. Jens-Peter Dittrich,Marcos Antonio Vaz Salles,Donald Kossmann,Lukas Blunschi 2005 Modern computer work stations provide thousands of applications that store data in > 100.000 files on the file system of the underlying OS. To handle these files data processing logic is reinvented inside each application. This results in a jungle of data processing solutions and a jungle of data and file formats. For a user, it is extremely hard to manage information in this jungle. Most of all it is impossible to use data distributed among different files and formats for combined queries, e.g., join and union operations. To solve the problems arising from file based data management, we present a software system called iMeMex as a unified solution to personal information management and integration. iMeMex is designed to integrate seamlessly into existing operating systems like Windows, Linux and Mac OS X. Our system enables existing applications to gradually dispose file based storage. By using iMeMex modern operating systems are enabled to make use of sophisticated DBMS, IR and data integration technologies. The seamless integration of iMeMex into existing operating systems enables new applications that provide concepts of data storage and analysis unseen before. VLDB Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases. Jost Enderle,Nicole Schneider,Thomas Seidl 2005 With the increasing occurrence of temporal and spatial data in present-day database applications, the interval data type is adopted by more and more database systems. For an efficient support of queries that contain selections on interval attributes as well as simple-valued attributes (e.g. numbers, strings) at the same time, special index structures are required supporting both types of predicates in combination. Based on the Relational Interval Tree, we present various indexing schemes that support such combined queries and can be integrated in relational database systems with minimum effort. Experiments on different query types show superior performance for the new techniques in comparison to competing access methods. VLDB The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS. Vuk Ercegovac,David J. DeWitt,Raghu Ramakrishnan 2005 "We introduce a benchmark called TEXTURE (TEXT Under RElations) to measure the relative strengths and weaknesses of combining text processing with a relational workload in an RDBMS. While the well-known TREC benchmarks focus on quality, we focus on efficiency. TEXTURE is a micro-benchmark for query workloads, and considers two central text support issues that previous benchmarks did not: (1) queries with relevance ranking, rather than those that just compute all answers, and (2) a richer mix of text and relational processing, reflecting the trend toward seamless integration. In developing this benchmark, we had to address the problem of generating large text collections that reflected the (performance) characteristics of a given ""seed"" collection; this is essential for a controlled study of specific data characteristics and their effects on performance. In addition to presenting the benchmark, with performance numbers for three commercial DBMSs, we present and validate a synthetic generator for populating text fields." VLDB Efficient Implementation of Large-Scale Multi-Structural Databases. Ronald Fagin,Phokion G. Kolaitis,Ravi Kumar,Jasmine Novak,D. Sivakumar,Andrew Tomkins 2005 Efficient Implementation of Large-Scale Multi-Structural Databases. VLDB Query Translation from XPath to SQL in the Presence of Recursive DTDs. Wenfei Fan,Jeffrey Xu Yu,Hongjun Lu,Jianhua Lu,Rajeev Rastogi 2005 The interaction between recursion in XPATH and recursion in DTDS makes it challenging to answer XPATH queries on XML data that is stored in an RDBMS via schema-based shredding. We present a new approach to translating XPATH queries into SQL queries with a simple least fixpoint (LFP) operator, which is already supported by most commercial RDBMS. The approach is based on our algorithm for rewriting XPATH queries into regular XPATH expressions, which are capable of capturing both DTD recursion and XPATH queries in a uniform framework. Furthermore, we provide an algorithm for translating regular XPATH queries to SQL queries with LFP, and optimization techniques for minimizing the use of the LFP operator. The novelty of our approach consists in its capability to answer a large class of XPATH queries by means of only low-end RDBMS features already available in most RDBMS. Our experimental results verify the effectiveness of our techniques. VLDB Optimizing Refresh of a Set of Materialized Views. Nathan Folkert,Abhinav Gupta,Andrew Witkowski,Sankar Subramanian,Srikanth Bellamkonda,Shrikanth Shankar,Tolga Bozkaya,Lei Sheng 2005 In many data warehousing environments, it is common to have materialized views (MVs) at different levels of aggregation of one or more dimensions. The extreme case of this is relational OLAP environments, where, for performance reasons, nearly all levels of aggregation across all dimensions may be computed and stored in MVs. Furthermore, base tables and MVs are usually partitioned for ease and speed of maintenance. In these scenarios, updates to the base table are done using Bulk or Partition operations like add, exchange, truncate and drop partition. If changes to base tables can be tracked at the partition level, join dependencies. functional dependencies and query rewrite can be used to optimize refresh of an individual MV. The refresh optimizer, in the presence of partitioned tables and MVs, may recognize dependencies between base table and the MV partitions leading to the generation of very efficient refresh expressions. Additionally, in the presence of multiple MVs, the refresh subsytem can come up with an optimal refresh schedule such that MVs can be refreshed using query rewrite against previously refreshed MVs. This makes the database server more manageable and user friendly since a single function call can optimally refresh all the MVs in the system. VLDB Semantic Adaptation of Schema Mappings when Schemas Evolve. Cong Yu,Lucian Popa 2005 Schemas evolve over time to accommodate the changes in the information they represent. Such evolution causes invalidation of various artifacts depending on the schemas, such as schema mappings. In a heterogenous environment, where cooperation among data sources depends essentially upon them, schema mappings must be adapted to reflect schema evolution. In this study, we explore the mapping composition approach for addressing this mapping adaptation problem. We study the semantics of mapping composition in the context of mapping adaptation and compare our approach with the incremental approach of Velegrakis et al [21]. We show that our method is superior in terms of capturing the semantics of both the original mappings and the evolution. We design and implement a mapping adaptation system based on mapping composition as well as additional mapping pruning techniques that significantly speed up the adaptation. We conduct comprehensive experimental analysis and show that the composition approach is practical in various evolution scenarios. The mapping language that we consider is a nested relational extension of the second-order dependencies of Fagin et al [7]. Our work can also be seen as an implementation of the mapping composition operator of the model management framework. VLDB Cache-conscious Frequent Pattern Mining on a Modern Processor. Amol Ghoting,Gregory Buehrer,Srinivasan Parthasarathy,Daehyun Kim,Anthony D. Nguyen,Yen-Kuang Chen,Pradeep Dubey 2005 In this paper, we examine the performance of frequent pattern mining algorithms on a modern processor. A detailed performance study reveals that even the best frequent pattern mining implementations, with highly efficient memory managers, still grossly under-utilize a modern processor. The primary performance bottlenecks are poor data locality and low instruction level parallelism (ILP). We propose a cache-conscious prefix tree to address this problem. The resulting tree improves spatial locality and also enhances the benefits from hardware cache line prefetching. Furthermore, the design of this data structure allows the use of a novel tiling strategy to improve temporal locality. The result is an overall speedup of up to 3.2 when compared with state-of-the-art implementations. We then show how these algorithms can be improved further by realizing a non-naive thread-based decomposition that targets simultaneously multi-threaded processors. A key aspect of this decomposition is to ensure cache re-use between threads that are co-scheduled at a fine granularity. This optimization affords an additional speedup of 50%, resulting in an overall speedup of up to 4.8. To the best of our knowledge, this effort is the first to target cache-conscious data mining. VLDB Discovering Large Dense Subgraphs in Massive Graphs. David Gibson,Ravi Kumar,Andrew Tomkins 2005 We present a new algorithm for finding large, dense subgraphs in massive graphs. Our algorithm is based on a recursive application of fingerprinting via shingles, and is extremely efficient, capable of handling graphs with tens of billions of edges on a single machine with modest resources.We apply our algorithm to characterize the large, dense subgraphs of a graph showing connections between hosts on the World Wide Web; this graph contains over 50M hosts and 11B edges, gathered from 2.1B web pages. We measure the distribution of these dense subgraphs and their evolution over time. We show that more than half of these hosts participate in some dense subgraph found by the analysis. There are several hundred giant dense subgraphs of at least ten thousand hosts; two thousand dense subgraphs at least a thousand hosts; and almost 64K dense subgraphs of at least a hundred hosts.Upon examination, many of the dense subgraphs output by our algorithm are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. We therefore propose dense subgraph extraction as a useful primitive for spam detection, and discuss its incorporation into the workflow of web search engines. VLDB Scaling and Time Warping in Time Series Querying. Ada Wai-Chee Fu,Eamonn J. Keogh,Leo Yung Hang Lau,Chotirat (Ann) Ratanamahatana 2005 The last few years have seen an increasing understanding that Dynamic Time Warping (DTW), a technique that allows local flexibility in aligning time series, is superior to the ubiquitous Euclidean Distance for time series classification, clustering, and indexing. More recently, it has been shown that for some problems, Uniform Scaling (US), a technique that allows global scaling of time series, may just be as important for some problems. In this work, we note that for many real world problems, it is necessary to combine both DTW and US to achieve meaningful results. This is particularly true in domains where we must account for the natural variability of human action, including biometrics, query by humming, motion-capture/animation, and handwriting recognition. We introduce the first technique which can handle both DTW and US simultaneously, and demonstrate its utility and effectiveness on a wide range of problems in industry, medicine, and entertainment. VLDB Parameter Free Bursty Events Detection in Text Streams. Gabriel Pui Cheong Fung,Jeffrey Xu Yu,Philip S. Yu,Hongjun Lu 2005 Text classification is a major data mining task. An advanced text classification technique is known as partially supervised text classification, which can build a text classifier using a small set of positive examples only. This leads to our curiosity whether it is possible to find a set of features that can be used to describe the positive examples. Therefore, users do not even need to specify a set of positive examples. As the first step, in this paper, we formalize it as a new problem, called hot bursty events detection, to detect bursty events from a text stream which is a sequence of chronologically ordered documents. Here, a bursty event is a set of bursty features, and is considered as a potential category to build a text classifier. It is important to know that the hot bursty events detection problem, we study in this paper, is different from TDT (topic detection and tracking) which attempts to cluster documents as events using clustering techniques. In other words, our focus is on detecting a set of bursty features for a bursty event. In this paper, we propose a new novel parameter free probabilistic approach, called feature-pivot clustering. Our main technique is to fully utilize the time information to determine a set of bursty features which may occur in differenttime windows. We detect bursty events based on the feature distributions. There is no need to tune or estimate any parameters. We conduct experiments using real life data, a major English newspaper in Hong Kong, and show that the parameter free feature-pivot clustering approach can detect the bursty events with a high success rate. VLDB Maximal Vector Computation in Large Data Sets. Parke Godfrey,Ryan Shipley,Jarek Gryz 2005 Finding the maximals in a collection of vectors is relevant to many applications. The maximal set is related to the convex hull---and hence, linear optimization---and nearest neighbors. The maximal vector problem has resurfaced with the advent of skyline queries for relational databases and skyline algorithms that are external and relationally well behaved.The initial algorithms proposed for maximals are based on divide-and-conquer. These established good average and worst case asymptotic running times, showing it to be O(n) average-case. where n is the number of vectors. However, they are not amenable to externalizing. We prove, furthermore, that their performance is quite bad with respect to the dimensionality, k, of the problem. We demonstrate that the more recent external skyline algorithms are actually better behaved, although they do not have as good an apparent asymptotic complexity. We introduce a new external algorithm, LESS, that combines the best features of these. experimentally evaluate its effectiveness and improvement over the field, and prove its average-case running time is O(kn). VLDB ConQuer: A System for Efficient Querying Over Inconsistent Databases. Ariel Fuxman,Diego Fuxman,Renée J. Miller 2005 Although integrity constraints have long been used to maintain data consistency, there are situations in which they may not be enforced or satisfied. In this demo, we showcase ConQuer, a system for efficient and scalable answering of SQL queries on databases that may violate a set of constraints. ConQuer permits users to postulate a set of key constraints together with their queries. The system rewrites the queries to retrieve all (and only) data that is consistent with respect to the constraints. The rewriting is into SQL, so the rewritten queries can be efficiently optimized and executed by commercial database systems. VLDB Database Change Notifications: Primitives for Efficient Database Query Result Caching. César A. Galindo-Legaria,Torsten Grabs,Christian Kleinerman,Florian Waas 2005 Many database applications implement caching of data from a back-end database server to avoid repeated round trips to the back-end and to improve response times for end-user requests. For example, consider a web application that caches dynamic web content in the mid-tier [3, 2]. The content of dynamic web pages is usually assembled from data stored in the underlying database system and subject to modification whenever the data sources are modified. The workload is ideal for caching query results: most queries are read-only (browsing sessions) and only a small portion of the queries are actually modifying data. Caching at the mid-tier helps off-load the back-end database servers and can increase scalability of a distributed system drastically. VLDB The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents. Jens Graupmann,Ralf Schenkel,Gerhard Weikum 2005 This paper presents the novel SphereSearch Engine that provides unified ranked retrieval on heterogeneous XML and Web data. Its search capabilities include vague structure conditions, text content conditions, and relevance ranking based on IR statistics and statistically quantified ontological relationships. Web pages in HTML or PDF are automatically converted into XML format, with the option of generating semantic tags by means of linguistic annotation tools. For Web data the XML-oriented query engine is leveraged to provide very rich search options that cannot be expressed in traditional Web search engines: concept-aware and link-aware querying that takes into account the implicit structure and context of Web pages. The benefits of the SphereSearch engine are demonstrated by experiments with a large and richly tagged but non-schematic open encyclopedia extended with external documents. VLDB Consistency for Web Services Applications. Paul Greenfield,Dean Kuo,Surya Nepal,Alan Fekete 2005 A key challenge facing the designers of service-oriented applications is ensuring that the autonomous services that make up these distributed applications always finish in consistent states despite application-level failures and other exceptional events. This paper addresses this problem by first describing the relationship between internal service states, messages and application protocols and then shows how this relationship transforms the problem of ensuring consistent outcomes into a correctness problem that can be addressed with established protocol verification tools. VLDB Space Efficiency in Synopsis Construction Algorithms. Sudipto Guha 2005 "Histograms and Wavelet synopses have been found to be useful in query optimization, approximate query answering and mining. Over the last few years several good synopsis algorithms have been proposed. These have mostly focused on the running time of the synopsis constructions, optimum or approximate, vis-a-vis their quality. However the space complexity of synopsis construction algorithms has not been investigated as thoroughly. Many of the optimum synopsis construction algorithms (as well as few of the approximate ones) are expensive in space. In this paper, we propose a general technique that reduces space complexity. We show that the notion of ""working space"" proposed in these contexts is redundant. We believe that our algorithm also generalizes to a broader range of dynamic programs beyond synopsis construction. Our modifications can be easily adapted to existing algorithms. We demonstrate the performance benefits through experiments on real-life and synthetic data." VLDB Offline and Data Stream Algorithms for Efficient Computation of Synopsis Structures. Sudipto Guha,Kyuseok Shim 2005 Synopsis and small space representations are important data analysis tools and have long been used OLAP/DSS systems, approximate query answering, query optimization and data mining. These techniques represent the input in terms broader characteristics and improve efficiency of various applications, e.g., learning, classification, event detection, among many others. In a recent past, the synopsis techniques have gained more currency due to the emerging areas like data stream management.In this tutorial, we propose to revisit algorithms for Wavelet and Histogram synopsis construction. In the recent years, a significant number of papers have appeared which has advanced the state-of-the-art in synopsis construction considerably. In particular, we have seen the development of a large number of efficient algorithms which are also guaranteed to be near optimal. Furthermore, these synopsis construction problems have found deep roots in theory and database systems, and have influenced a wide range of problems. In a different level, a large number of the synopsis construction algorithms use a similar set of techniques. It is extremely valuable to discuss and analyze these techniques, and we expect broader pictures and paradigms to emerge. This would allow us to develop algorithms for newer problems with greater ease. Understanding these recurrent themes and intuition behind the development of these algorithms is one of the main thrusts of the tutorial.Our goal will be to cover a wide spectrum of these topics and make the researchers in VLDB community aware of the new algorithms, optimum or approximate, offline or streaming. The tutorial will be self contained and develop most of the mathematical and database backgrounds needed. VLDB "Caching with 'Good Enough' Currency, Consistency, and Completeness." Hongfei Guo,Per-Åke Larson,Raghu Ramakrishnan 2005 SQL extensions that allow queries to explicitly specify data quality requirements in terms of currency and consistency were proposed in an earlier paper. This paper develops a data quality-aware, finer grained cache model and studies cache design in terms of four fundamental properties: presence, consistency, completeness and currency. The model provides an abstract view of the cache to the query processing layer, and opens the door for adaptive cache management. We describe an implementation approach that builds on the MTCache framework for partially materialized views. The optimizer checks most consistency constraints and generates a dynamic plan that includes currency checks and inexpensive checks for dynamic consistency constraints that cannot be validated during optimization. Our solution not only supports transparent caching but also provides fine grained data currency and consistency guarantees. VLDB Optimizing Nested Queries with Parameter Sort Orders. Ravindra Guravannavar,H. S. Ramanujam,S. Sudarshan 2005 Nested iteration is an important technique for query evaluation. It is the default way of executing nested subqueries in SQL. Although decorrelation often results in cheaper non-nested plans, decorrelation is not always applicable for nested subqueries. Nested iteration, if implemented properly, can also win over decorrelation for several classes of queries. Decorrelation is also hard to apply to nested iteration in user-defined SQL procedures and functions. Recent research has proposed evaluation techniques to speed up execution of nested iteration, but does not address the optimization issue. In this paper, we address the issue of exploiting the ordering of nested iteration/procedure calls to speed up nested iteration. We propose state retention of operators as an important technique to exploit the sort order of parameters/correlation variables. We then show how to efficiently extend an optimizer to take parameter sort orders into consideration. We implemented our evaluation techniques on PostgreSQL, and present performance results that demonstrate significant benefits. VLDB Link Spam Alliances. Zoltán Gyöngyi,Hector Garcia-Molina 2005 Link spam is used to increase the ranking of certain target web pages by misleading the connectivity-based ranking algorithms in search engines. In this paper we study how web pages can be interconnected in a spam farm in order to optimize rankings. We also study alliances, that is, interconnections of spam farms. Our results identify the optimal structures and quantify the potential gains. In particular, we show that alliances can be synergistic and improve the rankings of all participants. We believe that the insights we gain will be useful in identifying and combating link spam. VLDB Complex Spatio-Temporal Pattern Queries. Marios Hadjieleftheriou,George Kollios,Petko Bakalov,Vassilis J. Tsotras 2005 This paper introduces a novel type of query, what we name Spatio-temporal Pattern Queries (STP). Such a query specifies a spatiotemporal pattern as a sequence of distinct spatial predicates where the predicate temporal ordering (exact or relative) matters. STP queries can use various types of spatial predicates (range search, nearest neighbor, etc.) where each such predicate is associated (1) with an exact temporal constraint (a time-instant or a time-interval), or (2) more generally, with a relative order among the other query predicates. Using traditional spatiotemporal index structures for these types of queries would be either inefficient or not an applicable solution. Alternatively, we propose specialized query evaluation algorithms for STP queries With Time. We also present a novel index structure, suitable for STP queries With Order. Finally, we conduct a comprehensive experimental evaluation to show the merits of our techniques. VLDB Parallel Execution of Test Runs for Database Application Systems. Florian Haftmann,Donald Kossmann,Eric Lo 2005 In a recent paper [8], it was shown how tests for database application systems can be executed efficiently. The challenge was to control the state of the database during testing and to order the test runs in such a way that expensive reset operations that bring the database into the right state need to be executed as seldom as possible. This work extends that work so that test runs can be executed in parallel. The goal is to achieve linear speed-up and/or exploit the available resources as well as possible. This problem is challenging because parallel testing can involve interference between the execution of concurrent test runs. VLDB Getting Priorities Straight: Improving Linux Support for Database I/O. Christoffer Hall,Philippe Bonnet 2005 The Linux 2.6 kernel supports asynchronous I/O as a result of propositions from the database industry. This is a positive evolution but is it a panacea? In the context of the Badger project, a collaboration between MySQL AB and University of Copenhagen, we evaluate how MySQL/InnoDB can best take advantage of Linux asynchronous I/O and how Linux can help MySQL/InnoDB best take advantage of the underlying I/O bandwidth. This is a crucial problem for the increasing number of MySQL servers deployed for very large database applications. In this paper, we first show that the conservative I/O submission policy used by InnoDB (as well as Oracle 9.2) leads to an under-utilization of the available I/O bandwidth. We then show that introducing prioritized asynchronous I/O in Linux will allow MySQL/InnoDB and the other Linux databases to fully utilize the available I/O bandwith using a more aggressive I/O submission policy. VLDB WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web. Hai He,Weiyi Meng,Clement T. Yu,Zonghuan Wu 2005 We demonstrate WISE-Integrator - an automatic search interface extraction and integration tool. The basic research issues behind this tool will also be explained. VLDB Supporting RFID-based Item Tracking Applications in Oracle DBMS Using a Bitmap Datatype. Ying Hu,Seema Sundara,Timothy Chorma,Jagannathan Srinivasan 2005 Radio Frequency Identification (RFID) based item-level tracking holds the promise of revolutionizing supply-chain, retail store, and asset management applications. However, the high volume of data generated by item-level tracking poses challenges to the applications as well as to backend databases. This paper addresses the problem of efficiently modeling identifier collections occurring in RFID-based item-tracking applications and databases. Specifically, 1) a bitmap datatype is introduced to compactly represent a collection of identifiers, and 2) a set of bitmap access and manipulation routines is provided. The proposed bitmap datatype can model a collection of generic identifiers, including 64-bit, 96-bit, and 256-bit Electronic Product Codes™ (EPCs), and it can be used to represent both transient and persistent identifier collections. Persistent identifier collections can be stored in a table as a column of bitmap datatype. An efficient primary B+- tree-based storage scheme is proposed for such columns. The bitmap datatype can be easily implemented by leveraging the DBMS bitmap index implementation, which typically manages bitmaps of table row identifiers. This paper presents the bitmap datatype and related functionality, illustrates its usage in supporting RFID-based item-tracking applications, describes its prototype implementation in Oracle DBMS, and gives a performance study that characterizes the benefits of the bitmap datatype. VLDB Personalized Systems: Models and Methods from an IR and DB Perspective. Yannis E. Ioannidis,Georgia Koutrika 2005 "In today's knowledge-driven society, information abundance and personal electronic device ubiquity have made it difficult for users to find the right information at the right time and at the right level of detail. To solve this problem, researchers have developed systems that adapt their behavior to the goals, tasks, interests, and other characteristics of their users. Based on models that capture important user characteristics, these personalized systems maintain their users' profiles and take them into account to customize the content generated or its presentation to the different individuals." VLDB Customizable Parallel Execution of Scientific Stream Queries. Milena Ivanova,Tore Risch 2005 Scientific applications require processing high-volume on-line streams of numerical data from instruments and simulations. We present an extensible stream database system that allows scalable and flexible continuous queries on such streams. Application dependent streams and query functions are defined through an object-relational model. Distributed execution plans for continuous queries are described as high-level data flow distribution templates. Using a generic template we define two partitioning strategies for scalable parallel execution of expensive stream queries: window split and window distribute. Window split provides operators for parallel execution of query functions by reducing the size of stream data units using application dependent functions as parameters. By contrast, window distribute provides operators for customized distribution of entire data units without reducing their size. We evaluate these strategies for a typical high volume scientific stream application and show that window split is favorable when expensive queries are executed on limited resources, while window distribution is better otherwise. VLDB BATON: A Balanced Tree Structure for Peer-to-Peer Networks. H. V. Jagadish,Beng Chin Ooi,Quang Hieu Vu 2005 We propose a balanced tree structure overlay on a peer-to-peer network capable of supporting both exact queries and range queries efficiently. In spite of the tree structure causing distinctions to be made between nodes at different levels in the tree, we show that the load at each node is approximately equal. In spite of the tree structure providing precisely one path between any pair of nodes, we show that sideways routing tables maintained at each node provide sufficient fault tolerance to permit efficient repair. Specifically, in a network with N nodes, we guarantee that both exact queries and range queries can be answered in O(log N) steps and also that update operations (to both data and network) have an amortized cost of O(log N). An experimental assessment validates the practicality of our proposal. VLDB Online Estimation For Subset-Based SQL Queries. Chris Jermaine,Alin Dobra,Abhijit Pol,Shantanu Joshi 2005 The largest databases in use today are so large that answering a query exactly can take minutes, hours, or even days. One way to address this problem is to make use of approximation algorithms. Previous work on online aggregation has considered how to give online estimates with ever-increasing accuracy for aggregate functions over relational join and selection queries. However, no existing work is applicable to online estimation over subset-based SQL queries-those queries with a correlated subquery linked to an outer query via a NOT EXISTS, NOT IN, EXISTS, or IN clause (other queries such as EXCEPT and INTERSECT can also be seen as subset-based queries). In this paper we develop algorithms for online estimation over such queries, and consider the difficult problem of providing probabilistic accuracy guarantees at all times during query execution. VLDB Indexing Mixed Types for Approximate Retrieval. Liang Jin,Nick Koudas,Chen Li,Anthony K. H. Tung 2005 Indexing Mixed Types for Approximate Retrieval. VLDB Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin,Chen Li 2005 "Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as ""name similar to smith"" and ""telephone number similar to 412-0964."" Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called SEPIA, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates." VLDB A Heartbeat Mechanism and Its Application in Gigascope. Theodore Johnson,S. Muthukrishnan,Vladislav Shkapenyuk,Oliver Spatscheck 2005 "Data stream management systems often rely on ordering properties of tuple attributes in order to implement non-blocking operators. However, query operators that work with multiple streams, such as stream merge or join, can often still block if one of the input stream is very slow or bursty. In principle, punctuation and heartbeat mechanisms have been proposed to unblock streaming operators. In practice, it is a challenge to incorporate such mechanisms into a high-performance stream management system that is operational in an industrial application.In this paper, we introduce a system for punctuation-carrying heartbeat generation that we developed for Gigascope, a high-performance streaming database for network monitoring, that is operationally used within AT&T's IP backbone. We show how heartbeats can be regularly generated by low-level nodes in query execution plans and propagated upward unblocking all streaming operators on its way. Additionally, our heartbeat mechanism can be used for other applications in distributed settings such as detecting node failures, performance monitoring, and query optimization. A performance evaluation using live data feeds shows that our system is capable of working at multiple Gigabit line speeds in a live, industrial deployment and can significantly decrease the query memory utilization." VLDB Bidirectional Expansion For Keyword Search on Graph Databases. Varun Kacholia,Shashank Pandit,Soumen Chakrabarti,S. Sudarshan,Rushi Desai,Hrishikesh Karambelkar 2005 "Relational, XML and HTML data can be represented as graphs with entities as nodes and relationships as edges. Text is associated with nodes and possibly edges. Keyword search on such graphs has received much attention lately. A central problem in this scenario is to efficiently extract from the data graph a small number of the ""best"" answer trees. A Backward Expanding search, starting at nodes matching keywords and working up toward confluent roots, is commonly used for predominantly text-driven queries. But it can perform poorly if some keywords match many nodes, or some node has very large degree.In this paper we propose a new search algorithm, Bidirectional Search, which improves on Backward Expanding search by allowing forward search from potential roots towards leaves. To exploit this flexibility, we devise a novel search frontier prioritization technique based on spreading activation. We present a performance study on real data, establishing that Bidirectional Search significantly outperforms Backward Expanding search." VLDB One-Pass Wavelet Synopses for Maximum-Error Metrics. Panagiotis Karras,Nikos Mamoulis 2005 We study the problem of computing wavelet-based synopses for massive data sets in static and streaming environments. A compact representation of a data set is obtained after a thresholding process is applied on the coefficients of its wavelet decomposition. Existing polynomial-time thresholding schemes that minimize maximum error metrics are disadvantaged by impracticable time and space complexities and are not applicable in a data stream context. This is a cardinal issue, as the problem at hand in its most practically interesting form involves the time-efficient approximation of huge amounts of data, potentially in a streaming environment. In this paper we fill this gap by developing efficient and practicable wavelet thresholding algorithms for maximum-error metrics, for both a static and a streaming case. Our algorithms achieve near-optimal accuracy and superior runtime performance, as our experiments show, under frugal space requirements in both contexts. VLDB PrediCalc: A Logical Spreadsheet Management System. Michael Kassoff,Lee-Ming Zen,Ankit Garg,Michael R. Genesereth 2005 "Computerized spreadsheets are a great success. They are often touted in newspapers and magazine articles as the first ""killer app"" for personal computers. Over the years, they have proven their worth time and again. Today, they are used for managing enterprises of all sorts - from one-person projects to multi-institutional conglomerates." VLDB n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure. Min-Soo Kim,Kyu-Young Whang,Jae-Gil Lee,Min-Jae Lee 2005 The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index. VLDB Database-Inspired Search. David Konopnicki,Oded Shmueli 2005 """W3QL: A Query Language for the WWW"", published in 1995, presented a language with several distinctive features. Employing existing indexes as access paths, it allowed the selection of documents using conditions on semi-structured documents and maintaining dynamic views of navigational queries. W3QL was capable of automatically filling out forms and navigating through them. Finally, in the SQL tradition, it was a declarative query language, that could be the subject of optimization.Ten years later, we examine some current trends in the domain of search, namely the emergence of system-level search services and of the semantic web. In this context, we explore whether W3QL's ideas are still relevant to help improve information search and retrieval. We identify two main environments for searching, the enterprise and the web at large. Both environments could benefit from database-inspired integration language, and an execution system that implements it." VLDB Approximate Joins: Concepts and Techniques. Nick Koudas,Divesh Srivastava 2005 The quality of the data residing in information repositories and databases gets degraded due to a multitude of reasons. Such reasons include typing mistakes during insertion (e.g., character transpositions), lack of standards for recording database fields (e.g., addresses), and various errors introduced by poor database design (e.g., missing integrity constraints). Data of poor quality can result in significant impediments to popular business practices: sending products or bills to incorrect addresses, inability to locate customer records during service calls, inability to correlate customers across multiple services, etc. VLDB StreamGlobe: Processing and Sharing Data Streams in Grid-Based P2P Infrastructures. Richard Kuntschke,Bernhard Stegmaier,Alfons Kemper,Angelika Reiser 2005 Data stream processing is currently gaining importance due to the developments in novel application areas like e-science, e-health, and e-business (considering RFID, for example). Focusing on e-science, it can be observed that scientific experiments and observations in many fields, e. g., in physics and astronomy, create huge volumes of data which have to be interchanged and processed. With experimental and observational data coming in particular from sensors, online simulations, etc., the data has an inherently streaming nature. Furthermore, continuing advances will result in even higher data volumes, rendering storing all of the delivered data prior to processing increasingly impractical. Hence, in such e-science scenarios, processing and sharing of data streams will play a decisive role. It will enable new possibilities for researchers, since they will be able to subscribe to interesting data streams of other scientists without having to set up their own devices or experiments. This results in much better utilization of expensive equipment such as telescopes, satellites, etc. Further, processing and sharing data streams on-the-fly in the network helps to reduce network traffic and to avoid network congestion. Thus, even huge streams of data can be handled efficiently by removing unnecessary parts early on, e. g., by early filtering and aggregation, and by sharing previously generated data streams and processing results. VLDB FiST: Scalable XML Document Filtering by Sequencing Twig Patterns. Joonho Kwon,Praveen Rao,Bongki Moon,Sukho Lee 2005 "In recent years, publish-subscribe (pub-sub) systems based on XML document filtering have received much attention. In a typical pub-sub system, subscribed users specify their interest in profiles expressed in the XPath language, and each new content is matched against the user profiles so that the content is delivered to only the interested subscribers. As the number of subscribed users and their profiles can grow very large, the scalability of the system is critical to the success of pub-sub services. In this paper, we propose a novel scalable filtering system called FiST (Filtering by Sequencing Twigs) that transforms twig patterns expressed in XPath and XML documents into sequences using Prüfer's method. As a consequence, instead of matching linear paths of twig patterns individually and merging the matches during post-processing, FiST performs holistic matching of twig patterns with incoming documents. FiST organizes the sequences into a dynamic hash based index for efficient filtering. We demonstrate that our holistic matching approach yields lower filtering cost and good scalability under various situations." VLDB View Matching for Outer-Join Views. Per-Åke Larson,Jingren Zhou 2005 Prior work on computing queries from materialized views has focused on views defined by expressions consisting of selection, projection, and inner joins, with an optional aggregation on top (SPJG views). This paper provides the first view matching algorithm for views that may also contain outer joins (SPOJG views). The algorithm relies on a normal form for SPOJ expressions and does not use bottom-up syntactic matching of expressions. It handles any combination of inner and outer joins, deals correctly with SQL bag semantics and exploits not-null constraints, uniqueness constraints and foreign key constraints. VLDB Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results. Ramon Lawrence 2005 Minimizing both the response time to produce the first few thousand results and the overall execution time is important for interactive querying. Current join algorithms either minimize the execution time at the expense of response time or minimize response time by producing results early without optimizing the total time. We present a hash-based join algorithm, called early hash join, which can be dynamically configured at any point during join processing to tradeoff faster production of results for overall execution time. We demonstrate that varying how inputs are read has a major effect on these two factors and provide formulas that allow an optimizer to calculate the expected rate of join output and the number of I/O operations performed using different input reading strategies. Experimental results show that early hash join performs significantly fewer I/O operations and executes faster than other early join algorithms, especially for one-to-many joins. Its overall execution time is comparable to standard hybrid hash join, but its response time is an order of magnitude faster. Thus, early hash join can replace hybrid hash join in any situation where a fast initial response time is beneficial without the penalty in overall execution time exhibited by other early join algorithms. VLDB QoS-based Data Access and Placement for Federated Information Systems. Wen-Syan Li,Vishal S. Batra,Vijayshankar Raman,Wei Han,Inderpal Narang 2005 QoS-based Data Access and Placement for Federated Information Systems. VLDB Hubble: An Advanced Dynamic Folder Technology for XML. Ning Li,Joshua Hui,Hui-I Hsiao,Kevin S. Beyer 2005 A significant amount of information is stored in computer systems today, but people are struggling to manage their documents such that the information is easily found. XML is a de-facto standard for content publishing and data exchange. The proliferation of XML documents has created new challenges and opportunities for managing document collections. Existing technologies for automatically organizing document collections are either imprecise or based on only simple criteria. Since XML documents are self describing, it is now possible to automatically categorize XML documents precisely, according to their content. With the availability of the standard XML query languages, e.g. XQuery, much more powerful folder technologies are now feasible. To address this new challenge and exploit this new opportunity, this paper proposes a new and powerful dynamic folder mechanism, called Hubble. Hubble fully exploits the rich data model and semantic information embedded in the XML documents to build folder hierarchies dynamically and to categorize XML collections precisely. Besides supporting basic folder operations, Hubble also provides advanced features such as multi-path navigation and folder traversal across multiple document collections. Our performance study shows that Hubble is both efficient and scalable. Thus, it is an ideal technology for automating the process of organizing and categorizing XML documents. VLDB RankSQL: Supporting Ranking Queries in Relational Database Management Systems. Chengkai Li,Mohamed A. Soliman,Kevin Chen-Chuan Chang,Ihab F. Ilyas 2005 Ranking queries (or top-k queries) are dominant in many emerging applications, e.g., similarity queries in multimedia databases, searching Web databases, middleware, and data mining. The increasing importance of top-k queries warrants an efficient support of ranking in the relational database management system (RDBMS) and has recently gained the attention of the research community. Top-k queries aim at providing only the top k query results, according to a user-specified ranking function, which in many cases is an aggregate of multiple criteria. The following is an example top-k query. VLDB CXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation. Lipyeow Lim,Min Wang,Jeffrey Scott Vitter 2005 "Query optimization in IBM's System RX, the first truly relational-XML hybrid data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For most real XML data, the number of distinct string values at the leaf nodes is orders of magnitude larger than the set of distinct rooted tag paths. Hence, the real challenge lies in accurate selectivity estimation of the string predicates on the leaf values reachable via a given path.In this paper, we present CXHist, a novel workload-aware histogram technique that provides accurate selectivity estimation on a broad class of XML string-based queries. CXHist builds a histogram in an on-line manner by grouping queries into buckets using their true selectivity obtained from query feedback. The set of queries associated with each bucket is summarized into feature distributions. These feature distributions mimic a Bayesian classifier that is used to route a query to its associated bucket during selectivity estimation. We show how CXHist can be used for two general types of queries: exact match queries and substring match queries. Experiments using a prototype show that CXHist provides accurate selectivity estimation for both exact match queries and substring match queries." VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing. Bin Liu,Elke A. Rundensteiner 2005 Multi-join queries are the core of any integration service that integrates data from multiple distributed data sources. Due to the large number of data sources and possibly high volumes of data, the evaluation of multi-join queries faces increasing scalability concerns. State-of-the-art parallel multi-join query processing commonly assume that the application of maximal pipelined parallelism leads to superior performance. In this paper, we instead illustrate that this assumption does not generally hold. We investigate how best to combine pipelined parallelism with alternate forms of parallelism to achieve an overall effective processing strategy. A segmented bushy processing strategy is proposed. Experimental studies are conducted on an actual software system over a cluster of high-performance PCs. The experimental results confirm that the proposed solution leads to about 50% improvement in terms of total processing time in comparison to existing state-of-the-art solutions. VLDB A Dynamically Adaptive Distributed System for Processing Complex Continuous Queries. Bin Liu,Yali Zhu,Mariana Jbantova,Bradley Momberger,Elke A. Rundensteiner 2005 Recent years have witnessed rapidly growing research attention on continuous query processing over streams [2, 3]. A continuous query system can easily run out of resources in case of large amount of input stream data. Distributed continuous query processing over a shared nothing architecture, i.e., a cluster of machines, has been recognized as a scalable method to solve this problem [2, 8, 9]. Due to the lack of initial cost information and the fluctuating nature of the streaming data, uneven workload among machines may occur and this may impair the benefits of distributed processing. Thus dynamic adaptation techniques are crucial for a distributed continuous query system. VLDB From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. Jiaheng Lu,Tok Wang Ling,Chee Yong Chan,Ting Chen 2005 Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. A number of algorithms have been proposed to process a twig query based on region encoding labeling scheme. While region encoding supports efficient determination of structural relationship between two elements, we observe that the information within a single label is very limited. In this paper, we propose a new labeling scheme, called extended Dewey. This is a powerful labeling scheme, since from the label of an element alone, we can derive all the elements names along the path from the root to the element. Based on extended Dewey, we design a novel holistic twig join algorithm, called TJFast. Unlike all previous algorithms based on region encoding, to answer a twig query, TJFast only needs to access the labels of the leaf query nodes. Through this, not only do we reduce disk access, but we also support the efficient evaluation of queries with wildcards in branching nodes, which is very difficult to be answered by algorithms based on region encoding. Finally, we report our experimental results to show that our algorithms are superior to previous approaches in terms of the number of elements scanned, the size of intermediate results and query performance. VLDB Query Caching and View Selection for XML Databases. Bhushan Mandhani,Dan Suciu 2005 In this paper, we propose a method for maintaining a semantic cache of materialized XPath views. The cached views include queries that have been previously asked, and additional selected views. The cache can be stored inside or outside the database. We describe a notion of XPath query/view answerability, which allows us to reduce tree operations to string operations for matching a query/view pair. We show how to store and maintain the cached views in relational tables, so that cache lookup is very efficient. We also describe a technique for view selection, given a warm-up workload. We experimentally demonstrate the efficiency of our caching techniques, and performance gains obtained by employing such a cache. VLDB Consistently Estimating the Selectivity of Conjuncts of Predicates. Volker Markl,Nimrod Megiddo,Marcel Kutsch,Tam Minh Tran,Peter J. Haas,Utkarsh Srivastava 2005 "Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics (MVS) to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method efficiently exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Our implementation using a prototype version of DB2 UDB shows that ME improves the optimizer's cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times." VLDB The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management. Victor M. Markowitz,Frank Korzeniewski,Krishna Palaniappan,Ernest Szeto,Natalia Ivanova,Nikos Kyrpides 2005 Biological data management includes the traditional areas of data generation, acquisition, modelling, integration, and analysis. Although numerous academic biological data management systems are currently available, employing them effectively remains a significant challenge. We discuss how this challenge was addressed in the course of developing the Integrated Microbial Genomes (IMG) system for comparative analysis of microbial genome data. VLDB Mapping Maintenance for Data Integration Systems. Robert McCann,Bedoor K. AlShebli,Quoc Le,Hoa Nguyen,Long Vu,AnHai Doan 2005 To answer user queries, a data integration system employs a set of semantic mappings between the mediated schema and the schemas of data sources. In dynamic environments sources often undergo changes that invalidate the mappings. Hence, once the system is deployed, the administrator must monitor it over time, to detect and repair broken mappings. Today such continuous monitoring is extremely labor intensive, and poses a key bottleneck to the widespread deployment of data integration systems in practice.We describe MAVERIC, an automatic solution to detecting broken mappings. At the heart of MAVERIC is a set of computationally inexpensive modules called sensors, which capture salient characteristics of data sources (e.g., value distributions, HTML layout properties). We describe how MAVERIC trains and deploys the sensors to detect broken mappings. Next we develop three novel improvements: perturbation (i.e., injecting artificial changes into the sources) and multi-source training to improve detection accuracy, and filtering to further reduce the number of false alarms. Experiments over 114 real-world sources in six domains demonstrate the effectiveness of our sensor-based approach over existing solutions, as well as the utility of our improvements. VLDB Using a Fuzzy Classification Query Language for Customer Relationship Management. Andreas Meier,Nicolas Werro,Martin Albrecht,Miltiadis Sarakinos 2005 A key challenge for companies is to manage customer relationships as an asset. To create an effective toolkit for the analysis of customer relationships, a combination of relational databases and fuzzy logic is proposed. The fuzzy Classification Query Language allows marketers to improve customer equity, launch loyalty programs, automate mass customization, and refine marketing campaigns. VLDB Using Association Rules for Fraud Detection in Web Advertising Networks. Ahmed Metwally,Divyakant Agrawal,Amr El Abbadi 2005 Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, Streaming-Rules, to report association rules with tight guarantees on errors, using limited processing per element, and minimal space. The modular design of Streaming-Rules allows for integration with current stream management systems, since it employs existing techniques for finding frequent elements. The presentation emphasizes the applicability of the algorithm to fraud detection in advertising networks. Such fraud instances have not been successfully detected by current techniques. Our experiments on synthetic data demonstrate scalability and efficiency. On real data, potential fraud was discovered. VLDB KLEE: A Framework for Distributed Top-k Query Algorithms. Sebastian Michel,Peter Triantafillou,Gerhard Weikum 2005 This paper addresses the efficient processing of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We present KLEE, a novel algorithmic framework for distributed top-k queries, designed for high performance and flexibility. KLEE makes a strong case for approximate top-k algorithms over widely distributed data sources. It shows how great gains in efficiency can be enjoyed at low result-quality penalties. Further, KLEE affords the query-initiating peer the flexibility to trade-off result quality and expected performance and to trade-off the number of communication phases engaged during query execution versus network bandwidth performance. We have implemented KLEE and related algorithms and conducted a comprehensive performance evaluation. Our evaluation employed real-world and synthetic large, web-data collections, and query benchmarks. Our experimental results show that KLEE can achieve major performance gains in terms of network bandwidth, query response times, and much lighter peer loads, all with small errors in result precision and other result-quality measures. VLDB SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines. Boriana L. Milenova,Joseph Yarmus,Marcos M. Campos 2005 "Contemporary commercial databases are placing an increased emphasis on analytic capabilities. Data mining technology has become crucial in enabling the analysis of large volumes of data. Modern data mining techniques have been shown to have high accuracy and good generalization to novel data. However, achieving results of good quality often requires high levels of user expertise. Support Vector Machines (SVM) is a powerful state-of-the-art data mining algorithm that can address problems not amenable to traditional statistical analysis. Nevertheless, its adoption remains limited due to methodological complexities, scalability challenges, and scarcity of production quality SVM implementations. This paper describes Oracle's implementation of SVM where the primary focus lies on ease of use and scalability while maintaining high performance accuracy. SVM is fully integrated within the Oracle database framework and thus can be easily leveraged in a variety of deployment scenarios." VLDB Tree-Pattern Queries on a Lightweight XML Processor. Mirella Moura Moro,Zografoula Vagena,Vassilis J. Tsotras 2005 "Popular XML languages, like XPath, use ""tree-pattern"" queries to select nodes based on their structural characteristics. While many processing methods have already been proposed for such queries, none of them has found its way to any of the existing ""lightweight"" XML engines (i.e. engines without optimization modules). The main reason is the lack of a systematic comparison of query methods under a common storage model. In this work, we aim to fill this gap and answer two important questions: what the relative similarities and important differences among the tree-pattern query methods are, and if there is a prominent method among them in terms of effectiveness and robustness that an XML processor should support. For the first question, we propose a novel classification of the methods according to their matching process. We then describe a common storage model and demonstrate that the access pattern of each class conforms or can be adapted to conform to this model. Finally, we perform an experimental evaluation to compare their relative performance. Based on the evaluation results, we conclude that the family of holistic processing methods, which provides performance guarantees, is the most robust alternative for such an environment." VLDB Answering Imprecise Queries over Web Databases. Ullas Nambiar,Subbarao Kambhampati 2005 "The rapid expansion of the World Wide Web has made a large number of databases like bibliographies, scientific databases etc. to become accessible to lay users demanding ""instant gratification"". Often, these users may not know how to precisely express their needs and may formulate queries that lead to unsatisfactory results." VLDB Native XML Support in DB2 Universal Database. Matthias Nicola,Bert Van der Linden 2005 "The major relational database systems have been providing XML support for several years, predominantly by mapping XML to existing concepts such as LOBs or (object-)relational tables. The limitations of these approaches are well known in research and industry. Thus, a forthcoming version of DB2 Universal Database® is enhanced with comprehensive native XML support. ""Native"" means that XML documents are stored on disk pages in tree structures matching the XML data model. This avoids the mapping between XML and relational structures, and the corresponding limitations. The native XML storage is complemented with XML indexes, full XQuery, SQL/XML, and XML Schema support, as well as utilities such as a parallel high-speed XML bulk loader. This makes DB2 a true hybrid database system which places equal weight on XML and relational data management." VLDB Contextual Insight in Search: Enabling Technologies and Applications. Aleksander Øhrn 2005 Contextual Insight in Search: Enabling Technologies and Applications. VLDB Why Search Engines are Used Increasingly to Offload Queries from Databases. Bjørn Olstad 2005 The development of future search engine technology is no longer limited to free text. Rather, the aim is to build core indexing services that focus on extreme performance and scalability for retrieval and analysis across structured and unstructured data sources alike. In addition, binary query evaluation is being replaced with advanced frameworks that provide both fuzzy matching and ranking schemes, to separate value from noise. As another trend, analytical applications are being enabled by the computation of contextual concept relationships across billions of documents/records on-the-fly.Based on these developments in search engine technology, a set of new information retrieval infrastructure patterns are appearing:1. the mirroring of DB content into a search engine in order to improve query capacity and user experience,2. the use of search engine technology as the default access pattern to both structured and unstructured data in applications such as CRM and storage and document management, and3. a paradigm shift is predicted in business intelligence.The presentation will review key trends from search engine development and relate these to concrete user scenarios. VLDB XQuery Implementation in a Relational Database System. Shankar Pal,Istvan Cseri,Oliver Seeliger,Michael Rys,Gideon Schaller,Wei Yu,Dragan Tomic,Adrian Baras,Brandon Berg,Denis Churin,Eugene Kogan 2005 "Many enterprise applications prefer to store XML data as a rich data type, i.e. a sequence of bytes, in a relational database system to avoid the complexity of decomposing the data into a large number of tables and the cost of reassembling the XML data. The upcoming release of Microsoft's SQL Server supports XQuery as the query language over such XML data using its relational infrastructure.XQuery is an emerging W3C recommendation for querying XML data. It provides a set of language constructs (FLWOR), the ability to dynamically shape the query result, and a large set of functions and operators. It includes the emerging W3C recommendation XPath 2.0 for path-based navigational access. XQuery's type system is compatible with that of XML Schema and allows static type checking.This paper describes the experiences and the challenges in implementing XQuery in Microsoft's SQL Server 2005. XQuery language constructs are compiled into an enhanced set of relational operators while preserving the semantics of XQuery. The query tree is optimized using relational optimization techniques, such as cost-based decisions, and rewrite rules based on XML schemas. Novel techniques are used for efficiently managing document order and XML hierarchy." VLDB Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results. Sandeep Pandey,Sourashis Roy,Christopher Olston,Junghoo Cho,Soumen Chakrabarti 2005 "In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is closely correlated with quality, a more elusive concept that is difficult to measure directly. Unfortunately, the correlation between popularity and quality is very weak for newly-created pages that have yet to receive many visits and/or in-links. Worse, since discovery of new content is largely done by querying search engines, and because users usually focus their attention on the top few results, newly-created but high-quality pages are effectively ""shut out,"" and it can take a very long time before they become popular.We propose a simple and elegant solution to this problem: the introduction of a controlled amount of randomness into search result ranking methods. Doing so offers new pages a chance to prove their worth, although clearly using too much randomness will degrade result quality and annul any benefits achieved. Hence there is a tradeoff between exploration to estimate the quality of new pages and exploitation of pages already known to be of high quality. We study this tradeoff both analytically and via simulation, in the context of an economic objective function based on aggregate result quality amortized over time. We show that a modest amount of randomness leads to improved search results." VLDB Streaming Pattern Discovery in Multiple Time-Series. Spiros Papadimitriou,Jimeng Sun,Christos Faloutsos 2005 In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Time-series). Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreover, it is any-time, single pass, and it dynamically detects changes. The discovered trends can also be used to immediately spot potential anomalies, to do efficient forecasting and, more generally, to dramatically simplify further data processing. Our experimental evaluation and case studies show that SPIRIT can incrementally capture correlations and discover trends, efficiently and effectively. VLDB Pattern Tree Algebras: Sets or Sequences? Stelios Paparizos,H. V. Jagadish 2005 XML and XQuery semantics are very sensitive to the order of the produced output. Although pattern-tree based algebraic approaches are becoming more and more popular for evaluating XML, there is no universally accepted technique which can guarantee both a correct output order and a choice of efficient alternative plans.We address the problem using hybrid collections of trees that can be either sets or sequences or something in between. Each such collection is coupled with an Ordering Specification that describes how the trees are sorted (full, partial or no order). This provides us with a formal basis for developing a query plan having parts that maintain no order and parts with partial or full order.It turns out that duplicate elimination introduces some of the same issues as order maintenance: it is expensive and a single collection type does not always provide all the flexibility required to optimize this properly. To solve this problem we associate with each hybrid collection a Duplicate Specification that describes the presence or absence of duplicate elements in it. We show how to extend an existing bulk tree algebra, TLC [12], to use Ordering and Duplicate specifications and produce correctly ordered results. We also suggest some optimizations enabled by the flexibility of our approach, and experimentally demonstrate the performance increase due to them. VLDB Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces. Jian Pei,Wen Jin,Martin Ester,Yufei Tao 2005 The skyline operator is important for multi-criteria decision making applications. Although many recent studies developed efficient methods to compute skyline objects in a specific space, the fundamental problem on the semantics of skylines remains open: Why and in which subspaces is (or is not) an object in the skyline? Practically, users may also be interested in the skylines in any subspaces. Then, what is the relationship between the skylines in the subspaces and those in the super-spaces? How can we effectively analyze the subspace skylines? Can we efficiently compute skylines in various subspaces?In this paper, we investigate the semantics of skylines, propose the subspace skyline analysis, and extend the full-space skyline computation to subspace skyline computation. We introduce a novel notion of skyline group which essentially is a group of objects that are coincidentally in the skylines of some subspaces. We identify the decisive subspaces that qualify skyline groups in the subspace skylines. The new notions concisely capture the semantics and the structures of skylines in various subspaces. Multidimensional roll-up and drilldown analysis is introduced. We also develop an efficient algorithm, Skyey, to compute the set of skyline groups and, for each subspace, the set of objects that are in the subspace skyline. A performance study is reported to evaluate our approach. VLDB CMS-ToPSS: Efficient Dissemination of RSS Documents. Milenko Petrovic,Haifeng Liu,Hans-Arno Jacobsen 2005 "Recent years have seen a rise in the number of unconventional publishing tools on the Internet. Tools such as wikis, blogs, discussion forums, and web-based content management systems have experienced tremendous rise in popularity and use; primarily because they provide something traditional tools do not: easy of use for non computer-oriented users and they are based on the idea of ""collaboration."" It is estimated, by pewinternet.org, that 32 million people in the US read blogs (which represents 27% of the estimated 120 million US Internet users) while 8 million people have said that they have created blogs." VLDB Large Scale Data Warehouses on Grid: Oracle Database 10g and HP ProLiant Systems. Meikel Poess,Raghunath Othayoth Nambiar 2005 Grid computing has the potential to drastically change enterprise computing as we know it today. The main concept of grid computing is viewing computing as a utility. It should not matter where data resides, or what computer processes a task. This concept has been applied successfully to academic research. It also has many advantages for commercial data warehouse applications such as virtualization, flexible provisioning, reduced cost due to commodity hardware, high availability and high scale-out. In this paper we show how a large-scale, high-performing and scalable grid-based data warehouse can be implemented using commodity hardware (industry-standard x86-based). Oracle Database 10g and the Linux operating system. We further demonstrate this architecture in a recently published TPC-H benchmark. VLDB Parallel Querying with Non-Dedicated Computers. Vijayshankar Raman,Wei Han,Inderpal Narang 2005 "We present DITN, a new method of parallel querying based on dynamic outsourcing of join processing tasks to non-dedicated, heterogeneous computers. In DITN, partitioning is not the means of parallelism. Data layout decisions are taken outside the scope of the DBMS, and handled within the storage software; query processors see a ""Data In The Network"" image. This allows gradual scaleout as the workload grows, by using non-dedicated computers.A typical operator in a parallel query plan is Exchange [7]. We argue that Exchange is unsuitable for non-dedicated machines because it poorly addresses node heterogeneity, and is vulnerable to failures or load spikes during query execution. DITN uses an alternate intra-fragment parallelism where each node executes an independent select-project-join-aggregate-group by block, with no tuple exchange between nodes. This method cleanly handles heterogeneous nodes, and well adapts during execution to node failures or load spikes.Initial experiments suggest that DITN performs competitively with a traditional configuration of dedicated machines and well-partitioned data for up to 10 processors at least. At the same time, DITN gives significant flexibility in terms of gradual scaleout and handling of heterogeneity, load bursts, and failures." VLDB A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing. Slobodan Rasetic,Jörg Sander,James Elding,Mario A. Nascimento 2005 This paper addresses the problem of splitting trajectories optimally for the purpose of efficiently supporting spatio-temporal range queries using index structures (e.g., R-trees) that use minimum bounding hyper-rectangles as trajectory approximations. We derive a formal cost model for estimating the number of I/Os required to evaluate a spatio-temporal range query with respect to a given query size and an arbitrary split of a trajectory. Based on the proposed model, we introduce a dynamic programming algorithm for splitting a set of trajectories that minimizes the number of expected disk I/Os with respect to an average query size. In addition, we develop a linear time, near optimal solution for this problem to be used in a dynamic case where trajectory points are continuously updated. Our experimental evaluation confirms the effectiveness and efficiency of our proposed splitting policies when embedded into an R-tree structure. VLDB Analyzing Plan Diagrams of Database Query Optimizers. Naveen Reddy,Jayant R. Haritsa 2005 "A ""plan diagram"" is a pictorial enumeration of the execution plan choices of a database query optimizer over the relational selectivity space. In this paper, we present and analyze representative plan diagrams on a suite of popular commercial query optimizers for queries based on the TPC-H benchmark. These diagrams, which often appear similar to cubist paintings, provide a variety of interesting insights, including that current optimizers make extremely fine-grained plan choices, which may often be supplanted by less efficient options without substantively affecting the quality; that the plan optimality regions may have highly intricate patterns and irregular boundaries, indicating strongly non-linear cost models; that non-monotonic cost behavior exists where increasing result cardinalities decrease the estimated cost; and, that the basic assumptions underlying the research literature on parametric query optimization often do not hold in practice." VLDB Data Sharing in the Hyperion Peer Database System. Patricia Rodríguez-Gianolli,Maddalena Garzetti,Lei Jiang,Anastasios Kementsietsidis,Iluju Kiringa,Mehedi Masud,Renée J. Miller,John Mylopoulos 2005 Data Sharing in the Hyperion Peer Database System. VLDB Recovery Principles in MySQL Cluster 5.1. Mikael Ronström,Jonas Oreland 2005 Recovery Principles in MySQL Cluster 5.1. VLDB A Faceted Query Engine Applied to Archaeology. Kenneth A. Ross,Angel Janevski,Julia Stoyanovich 2005 In this demonstration, we describe a system for storing and querying faceted hierarchies. We have developed a general faceted domain model and a query language for hierarchically classified data. We present here the use of our system on two real archaeological datasets containing thousands of artifacts. Our system is a sharable, evolvable resource that can provide global access to sizeable datasets in queriable format, and can serve as a valuable tool for data analysis and research in many application domains. VLDB General Purpose Database Summarization. Régis Saint-Paul,Guillaume Raschia,Noureddine Mouaddib 2005 In this paper, a message-oriented architecture for large database summarization is presented. The summarization system takes a database table as input and produces a reduced version of this table through both a rewriting and a generalization process. The resulting table provides tuples with less precision than the original but yet are very informative of the actual content of the database. This reduced form can be used as input for advanced data mining processes as well as some specific application presented in other works. We describe the incremental maintenance of the summarized table, the system capability to directly deal with XML database systems, and finally scalability which allows it to handle very large datasets of a million record. VLDB Tuning Schema Matching Software using Synthetic Scenarios. Mayssam Sayyadian,Yoonkyong Lee,AnHai Doan,Arnon Rosenthal 2005 "Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user must then tune the system: select the right component to be executed and correctly adjust their numerous ""knobs"" (e.g., thresholds, formula coefficients). Tuning is skill- and time-intensive, but (as we show) without it the matching accuracy is significantly inferior.We describe eTuner, an approach to automatically tune schema matching systems. Given a schema S, we match S against synthetic schemas, for which the ground truth mapping is known, and find a tuning that demonstrably improves the performance of matching S against real schemas. To efficiently search the huge space of tuning configurations, eTuner works sequentially, starting with tuning the lowest level components. To increase the applicability of eTuner, we develop methods to tune a broad range of matching components. While the tuning process is completely automatic, eTuner can also exploit user assistance (whenever available) to further improve the tuning quality. We employed eTuner to tune four recently developed matching systems on several real-world domains. eTuner produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods, at virtually no cost to the domain user." VLDB Robust Real-time Query Processing with QStream. Sven Schmidt,Thomas Legler,Sebastian Schär,Wolfgang Lehner 2005 Processing data streams with Quality-of-Service (QoS) guarantees is an emerging area in existing streaming applications. Although it is possible to negotiate the result quality and to reserve the required processing resources in advance, it remains a challenge to adapt the DSMS to data stream characteristics which are not known in advance or are difficult to obtain. Within this paper we present the second generation of our QStream DSMS which addresses the above challenge by using a real-time capable operating system environment for resource reservation and by applying an adaptation mechanism if the data stream characteristics change spontaneously. VLDB Client Assignment in Content Dissemination Networks for Dynamic Data. Shetal Shah,Krithi Ramamritham,Chinya V. Ravishankar 2005 Consider a content distribution network consisting of a set of sources, repositories and clients where the sources and the repositories cooperate with each other for efficient dissemination of dynamic data. In this system, necessary changes are pushed from sources to repositories and from repositories to clients so that they are automatically informed about the changes of interest. Clients and repositories associate coherence requirements with a data item d, denoting the maximum permissible deviation of the value of d known to them from the value at the source. Given a list of served by each repository and a set of requests, we address the following problem: How do we assign clients to the repositories, so that the fidelity, that is, the degree to which client coherence requirements are met, is maximized?In this paper, we first prove that the client assignment problem is NP-Hard. Given the closeness of the client-repository assignment problem and the matching problem in combinatorial optimization, we have tailored and studied two available solutions to the matching problem from the literature: (i) max-flow min-cost and (ii) stable-marriages. Our empirical results using real-world dynamic data show that the presence of coherence requirements adds a new dimension to the client-repository assignment problem. An interesting result is that in update intensive situations a better fidelity can be delivered to the clients by attempting to deliver data to some of the clients at a coherence lower than what they desire. A consequence of this observation is the necessity for quick adaptation of the delivered (vs. desired) data coherence with respect to the changes in the dynamics of the system. We develop techniques for such adaptation and show their impressive performance. VLDB Query Execution Assurance for Outsourced Databases. Radu Sion 2005 In this paper we propose and analyze a method for proofs of actual query execution in an outsourced database framework, in which a client outsources its data management needs to a specialized provider. The solution is not limited to simple selection predicate queries but handles arbitrary query types. While this work focuses mainly on read-only, compute-intensive (e.g. data-mining) queries, it also provides preliminary mechanisms for handling data updates (at additional costs). We introduce query execution proofs; for each executed batch of queries the database service provider is required to provide a strong cryptographic proof that provides assurance that the queries were actually executed correctly over their entire target data set. We implement a proof of concept and present experimental results in a real-world data mining application, proving the deployment feasibility of our solution. We analyze the solution and show that its overheads are reasonable and are far outweighed by the added security benefits. For example an assurance level of over 95% can be achieved with less than 25% execution time overhead. VLDB C-Store: A Column-oriented DBMS. "Michael Stonebraker,Daniel J. Abadi,Adam Batkin,Xuedong Chen,Mitch Cherniack,Miguel Ferreira,Edmond Lau,Amerson Lin,Samuel Madden,Elizabeth J. O'Neil,Patrick E. O'Neil,Alex Rasin,Nga Tran,Stanley B. Zdonik" 2005 This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures.We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging. VLDB Semantic Query Optimization for XQuery over XML Streams. Hong Su,Elke A. Rundensteiner,Murali Mani 2005 We study XML stream-specific schema-based optimization. We assume a widely-adopted automata-based execution model for XQuery evaluation. Criteria are established regarding what schema constraints are useful to a particular query. How to apply multiple optimization techniques on an XQuery is then addressed. Finally we present how to correctly and efficiently execute a plan enhanced with our SQO techniques. Our experimentation on both real and synthetic data illustrates that these techniques bring significant performance improvement with little overhead. VLDB Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions. Yufei Tao,Reynold Cheng,Xiaokui Xiao,Wang Kay Ngai,Ben Kao,Sunil Prabhakar 2005 "In an ""uncertain database"", an object o is associated with a multi-dimensional probability density function(pdf), which describes the likelihood that o appears at each position in the data space. A fundamental operation is the ""probabilistic range search"" which, given a value pq and a rectangular area rq, retrieves the objects that appear in rq with probabilities at least pq. In this paper, we propose the U-tree, an access method designed to optimize both the I/O and CPU time of range retrieval on multi-dimensional imprecise data. The new structure is fully dynamic (i.e., objects can be incrementally inserted/deleted in any order), and does not place any constraints on the data pdfs. We verify the query and update efficiency of U-trees with extensive experiments." VLDB An Efficient and Versatile Query Engine for TopX Search. Martin Theobald,Ralf Schenkel,Gerhard Weikum 2005 This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but nonschematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The difficulties in applying the existing top-k algorithms to XML data lie in 1) the need to consider scores for XML elements while aggregating them at the document level, 2) the combination of vague content conditions with XML path conditions, 3) the need to relax query conditions if too few results satisfy all conditions, and 4) the selectivity estimation for both content and structure conditions and their impact on evaluation strategies. TopX addresses these issues by precomputing score and path information in an appropriately designed index structure, by largely avoiding or postponing the evaluation of expensive path conditions so as to preserve the sequential access pattern on index lists, and by selectively scheduling random accesses when they are cost-beneficial. In addition, TopX can compute approximate top-k results using probabilistic score estimators, thus speeding up queries with a small and controllable loss in retrieval precision. VLDB Querying Business Processes with BP-QL. Catriel Beeri,Anat Eyal,Simon Kamenkovich,Tova Milo 2005 We present in this paper BP-QL, a novel query language for querying business processes. The BP-QL language is based on an intuitive model of business processes, an abstraction of the emerging BPEL (business process execution language) standard. It allows users to query business processes visually, in a manner very analogous to how such processes are typically specified, and can be employed in a distributed setting, where process components may be provided by distinct providers. We describe here the query language as well as its underlying formal model. We consider the properties of the various language components and explain how they influenced the language design. In particular we distinguish features that can be efficiently supported, and those that incur a prohibitively high cost, or cannot be computed at all. We also present our implementation which complies with real life standards for business process specifications, XML, and Web services, and is used in the BP-QL system. VLDB Temporal Management of RFID Data. Fusheng Wang,Peiya Liu 2005 RFID technology can be used to significantly improve the efficiency of business processes by providing the capability of automatic identification and data capture. This technology poses many new challenges on current data management systems. RFID data are time-dependent, dynamically changing, in large volumes, and carry implicit semantics. RFID data management systems need to effectively support such large scale temporal data created by RFID applications. These systems need to have an explicit temporal data model for RFID data to support tracking and monitoring queries. In addition, they need to have an automatic method to transform the primitive observations from RFID readers into derived data used in RFID-enabled applications. In this paper, we present an integrated RFID data management system -- Siemens RFID Middleware -- based on an expressive temporal data model for RFID data. Our system enables semantic RFID data filtering and automatic data transformation based on declarative rules, provides powerful query support of RFID object tracking and monitoring, and can be adapted to different RFID-enabled applications. VLDB Efficient Processing of XML Path Queries Using the Disk-based F&B Index. Wei Wang,Hongzhi Wang,Hongjun Lu,Haifeng Jiang,Xuemin Lin,Jianzhong Li 2005 Efficient Processing of XML Path Queries Using the Disk-based F&B Index. VLDB Query By Excel. Andrew Witkowski,Srikanth Bellamkonda,Tolga Bozkaya,Aman Naimat,Lei Sheng,Sankar Subramanian,Allison Waingold 2005 Spreadsheets, and MS Excel in particular, are established analysis tools. They offer an attractive user interface, provide an easy to use computational model, and offer substantial interactivity for what-if analysis. However, as opposed to RDBMS, spreadsheets do not provide a central repository hence they do not provide shareability of models built in Excel and lead to proliferation of multiple copies of the same spreadsheet. Furthermore, spreadsheets do not offer scalable computation, for example, they lack parallelization. To address the shareability, and scalability problems, we propose to automatically translate Excel computation into SQL. An analyst can import the data from a relational system, define computation over it using familiar Excel formulas and then translate and store it as a relational SQL view over the imported data. The Excel computation is then performed by the relational system. To edit the model, the analyst can bring the model back to Excel, modify it in Excel and store it back as an SQL View. We refer to this system as Query by Excel, QBX in short. VLDB On Computing Top-t Most Influential Spatial Sites. Tian Xia,Donghui Zhang,Evangelos Kanoulas,Yang Du 2005 Given a set O of weighted objects, a set S of sites, and a query site s, the bichromatic RNN query computes the influence set of s, or the set of objects in O that consider s as the nearest site among all sites in S. The influence of a site s can be defined as the total weight of its RNNs. This paper addresses the new and interesting problem of finding the top-t most influential sites from S, inside a given spatial region Q. A straightforward approach is to find the sites in Q, and compute the RNNs of every such site. This approach is not efficient for two reasons. First, all sites in Q need to be identified whatsoever, and the number may be large. Second, both the site R-tree and the object R-tree need to be browsed a large number of times. For each site in Q, the R-tree of sites is browsed to identify the influence region -- a polygonal region that may contain RNNs, and then the R-tree of objects is browsed to find the RNN set. This paper proposes an algorithm called TopInfluential-Sites, which finds the top-t most influential sites by browsing both trees once systematically. Novel pruning techniques are provided, based on a new metric called minExistDNN. There is no need to compute the influence for all sites in Q, or even to visit all sites in Q. Experimental results verify that our proposed method outperforms the straightforward approach. VLDB Mining Compressed Frequent-Pattern Sets. Dong Xin,Jiawei Han,Xifeng Yan,Hong Cheng 2005 A major challenge in frequent-pattern mining is the sheer size of its mining results. In many cases, a high min_sup threshold may discover only commonsense patterns but a low one may generate an explosive number of output patterns, which severely restricts its usage.In this paper, we study the problem of compressing frequent-pattern sets. Typically, frequent patterns can be clustered with a tightness measure δ (called δ-cluster), and a representative pattern can be selected for each cluster. Unfortunately, finding a minimum set of representative patterns is NP-Hard. We develop two greedy methods, RPglobal and RPlocal. The former has the guaranteed compression bound but higher computational complexity. The latter sacrifices the theoretical bounds but is far more efficient. Our performance study shows that the compression quality using RPlocal is very close to RPglobal, and both can reduce the number of closed frequent patterns by almost two orders of magnitude. Furthermore, RPlocal mines even faster than FPClose[11], a very fast closed frequent-pattern mining method. We also show that RPglobal and RPlocal can be combined together to balance the quality and efficiency. VLDB Checking for k-Anonymity Violation by Views. Chao Yao,Xiaoyang Sean Wang,Sushil Jajodia 2005 When a private relational table is published using views, secrecy or privacy may be violated. This paper uses a formally-defined notion of k-anonymity to measure disclosure by views, where k >1 is a positive integer. Intuitively, violation of k-anonymity occurs when a particular attribute value of an entity can be determined to be among less than k possibilities by using the views together with the schema information of the private table. The paper shows that, in general, whether a set of views violates k-anonymity is a computationally hard problem. Subcases are identified and their computational complexities discussed. Especially interesting are those subcases that yield polynomial checking algorithms (in the number of tuples in the views). The paper also provides an efficient conservative algorithm that checks for necessary conditions for k-anonymity violation. VLDB Improving Database Performance on Simultaneous Multithreading Processors. Jingren Zhou,John Cieslewicz,Kenneth A. Ross,Mihir Shah 2005 Simultaneous multithreading (SMT) allows multiple threads to supply instructions to the instruction pipeline of a superscalar processor. Because threads share processor resources, an SMT system is inherently different from a multiprocessor system and, therefore, utilizing multiple threads on an SMT processor creates new challenges for database implementers.We investigate three thread-based techniques to exploit SMT architectures on memory-resident data. First, we consider running independent operations in separate threads, a technique applied to conventional multi-processor systems. Second, we describe a novel implementation strategy in which individual operators are implemented in a multi-threaded fashion. Finally, we introduce a new data-structure called a work-ahead set that allows us to use one of the threads to aggressively preload data into the cache.We evaluate each method with respect to its performance, implementation complexity, and other measures. We also provide guidance regarding when and how to best utilize the various threading techniques. Our experimental results show that by taking advantage of SMT technology we achieve a 30% to 70% improvement in throughput over single threaded implementations on in-memory database operations. VLDB WmXML: A System for Watermarking XML Data. Xuan Zhou,HweeHwa Pang,Kian-Lee Tan,Dhruv Mangla 2005 As increasing amount of data is published in the form of XML, copyright protection of XML data is becoming an important requirement for many applications. While digital watermarking is a widely used measure to protect digital data from copyright offences, the complex and flexible construction of XML data poses a number of challenges to digital watermarking, such as re-organization and alteration attacks. To overcome these challenges, the watermarking scheme has to be based on the usability of data and the underlying semantics like key attributes and functional dependencies. In this paper, we describe WmXML, a system for watermarking XML documents. It generates queries from essential semantics to identify the available watermarking bandwidth in XML documents, and integrates query rewriting technique to overcome the threats from data re-organization and alteration. In the demonstration, we will showcase the use of WmXML and its effectiveness in countering various attacks. VLDB Efficient Computation of the Skyline Cube. Yidong Yuan,Xuemin Lin,Qing Liu,Wei Wang,Jeffrey Xu Yu,Qing Zhang 2005 "Skyline has been proposed as an important operator for multi-criteria decision making, data mining and visualization, and user-preference queries. In this paper. we consider the problem of efficiently computing a SKYCUBE, which consists of skylines of all possible non-empty subsets of a given set of dimensions. While existing skyline computation algorithms can be immediately extended to computing each skyline query independently, such ""shared-nothing"" algorithms are inefficient. We develop several computation sharing strategies based on effectively identifying the computation dependencies among multiple related skyline queries. Based on these sharing strategies, two novel algorithms, Bottom-Up and Top-Down algorithms, are proposed to compute SKYCUBE efficiently. Finally, our extensive performance evaluations confirm the effectiveness of the sharing strategies. It is shown that new algorithms significantly outperform the naïve ones." VLDB Distributed Privacy Preserving Information Sharing. Nan Zhang,Wei Zhao 2005 In this paper, we address issues related to sharing information in a distributed system consisting of autonomous entities, each of which holds a private database. Semi-honest behavior has been widely adopted as the model for adversarial threats. However, it substantially underestimates the capability of adversaries in reality. In this paper, we consider a threat space containing more powerful adversaries that includes not only semi-honest but also those malicious adversaries. In particular, we classify malicious adversaries into two widely existing subclasses, called weakly malicious and strongly malicious adversaries, respectively. We define a measure of privacy leakage for information sharing systems and propose protocols that can effectively and efficiently protect privacy against different kinds of malicious adversaries. VLDB AReNA: Adaptive Distributed Catalog Infrastructure Based On Relevance Networks. Vladimir Zadorozhny,Avigdor Gal,Louiqa Raschid,Qiang Ye 2005 Wide area applications (WAAs) utilize a WAN infrastructure (e.g., the Internet) to connect a federation of hundreds of servers with tens of thousands of clients. Earlier generations of WAA relied on Web accessible sources and the http protocol for data delivery. Recent developments such as the PlanetLab [8] testbed is now demonstrating an emerging class of data- and compute- intensive wide area applications. VLDB Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly. Zhen Zhang,Bin He,Kevin Chen-Chuan Chang 2005 "The Web has been rapidly ""deepened"" by myriad searchable databases online, where data are hidden behind query forms. Helping users query alternative ""deep Web"" sources in the same domain (e.g., Books, Airfares) is an important task with broad applications. As a core component of those applications, dynamic query translation (i.e., translating a user's query across dynamically selected sources) has not been extensively explored. While existing works focus on isolated subproblems (e.g., schema matching, query rewriting) to study, we target at building a complete query translator and thus face new challenges: 1) To complete the translator, we need to solve the predicate mapping problem (i.e., map a source predicate to target predicates), which is largely unexplored by existing works; 2) To satisfy our application requirements, we need to design a customizable system architecture to assemble various components addressing respective subproblems (i.e., schema matching, predicate mapping, query rewriting). Tackling these challenges, we develop a light-weight domain-based form assistant, which can generally handle alternative sources in the same domain and is easily customizable to new domains. Our experiment shows the effectiveness of our form assistant in translating queries for real Web sources." VLDB Statistical Learning Techniques for Costing XML Queries. Ning Zhang,Peter J. Haas,Vanja Josifovski,Guy M. Lohman,Chun Zhang 2005 "Developing cost models for query optimization is significantly harder for XML queries than for traditional relational queries. The reason is that XML query operators are much more complex than relational operators such as table scans and joins. In this paper, we propose a new approach, called COMET, to modeling the cost of XML operators; to our knowledge, COMET is the first method ever proposed for addressing the XML query costing problem. As in relational cost estimation, COMET exploits a set of system catalog statistics that summarizes the XML data; the set of ""simple path"" statistics that we propose is new, and is well suited to the XML setting. Unlike the traditional approach, COMET uses a new statistical learning technique called ""transform regression"" instead of detailed analytical models to predict the overall cost. Besides rendering the cost estimation problem tractable for XML queries, COMET has the further advantage of enabling the query optimizer to be self-tuning, automatically adapting to changes over time in the query workload and in the system environment. We demonstrate COMET's feasibility by developing a cost model for the recently proposed XNAV navigational operator. Empirical studies with synthetic, benchmark, and real-world data sets show that COMET can quickly obtain accurate cost estimates for a variety of XML queries and data sets." SIGMOD Record An apples-to-apples comparison of two database journals. Philip A. Bernstein,Elisa Bertino,Andreas Heuer,Christian S. Jensen,Holger Meyer,M. Tamer Özsu,Richard T. Snodgrass,Kyu-Young Whang 2005 This paper defines a collection of metrics on manuscript reviewing and presents historical data for ACM Transactions on Database Systems and The VLDB Journal. SIGMOD Record In memory of Seymour Ginsburg 1928 - 2004. Serge Abiteboul,Richard Hull,Victor Vianu,Sheila A. Greibach,Michael A. Harrison,Ellis Horowitz,Daniel J. Rosenkrantz,Jeffrey D. Ullman,Moshe Y. Vardi 2005 In memory of Seymour Ginsburg 1928 - 2004. SIGMOD Record Exchange, integration, and consistency of data: report on the ARISE/NISR workshop. Leopoldo E. Bertossi,Jan Chomicki,Parke Godfrey,Phokion G. Kolaitis,Alex Thomo,Calisto Zuzarte 2005 "The ""ARISE/NISR Workshop on Exchange and Integration of Data"" was held at the IBM Center for Advanced Studies, Toronto Lab., between October 7-9, 2004." SIGMOD Record Information source selection for resource constrained environments. Demet Aksoy 2005 "Distributed information retrieval has pressing scalability concerns due to the growing number of independent sources of on-line data and the emerging applications. A promising solution to distributed retrieval is metasearching, which dispatches a user's query to multiple sources and gathers the results into a single result set. An important component of metasearching is selecting the set of information sources most likely to provide relevant documents. Recent research has focused on how to obtain statistics for the selection task. In this paper we discuss different information source selection approaches and their applicability for resource-constrained sensor network applications." SIGMOD Record Report on the DB/IR panel at SIGMOD 2005. Sihem Amer-Yahia,Pat Case,Thomas Rölleke,Jayavel Shanmugasundaram,Gerhard Weikum 2005 "This paper summarizes the salient aspects of the SIGMOD 2005 panel on ""Databases and Information Retrieval: Rethinking the Great Divide"". The goal of the panel was to discuss whether we should rethink data management systems architectures to truly merge Database (DB) and Information Retrieval (IR) technologies. The panel had very high attendance and generated lively discussions." SIGMOD Record Report from the first international workshop on computer vision meets databases (CVDB 2004). Laurent Amsaleg,Björn Þór Jónsson,Vincent Oria 2005 This report summarizes the presentations and discussions of the First International Workshop on Computer Vision meets Databases, or CVDB 2004, which was held in Paris, France, on June 13, 2004. The workshop was co-located with the 2004 ACM SIGMOD/PODS conferences and was attended by forty-two participants from all over the world. SIGMOD Record Artemis message exchange framework: semantic interoperability of exchanged messages in the healthcare domain. Veli Bicer,Gokce Laleci,Asuman Dogac,Yildiray Kabak 2005 One of the most challenging problems in the healthcare domain is providing interoperability among healthcare information systems. In order to address this problem, we propose the semantic mediation of exchanged messages. Given that most of the messages exchanged in the healthcare domain are in EDI (Electronic Data Interchange) or XML format, we describe how to transform these messages into OWL (Web Ontology Language) ontology instances. The OWL message instances are then mediated through an ontology mapping tool that we developed, namely, OWLmt. OWLmt uses OWL-QL engine which enables the mapping tool to reason over the source ontology instances while generating the target ontology instances according to the mapping patterns defined through a GUI.Through a prototype implementation, we demonstrate how to mediate between HL7 Version 2 and HL7 Version 3 messages. However, the framework proposed is generic enough to mediate between any incompatible healthcare standards that are currently in use. SIGMOD Record Report on MobiDE 2003: the 3rd international ACM Workshop on Data Engineering for Wireless and Mobile Access. Sujata Banerjee,Mitch Cherniack,Panos K. Chrysanthis,Vijay Kumar,Alexandros Labrinidis 2005 The 3rd International ACM Workshop on Data Engineering for Wireless and Mobile Access (MobiDE 2003 for short) took place on September 19, 2003 at the Westin Horton Plaza Hotel in San Diego, California in conjunction with MobiCom 2003. The MobiDE workshops serve as a bridge between the data management and network research communities, and have a tradition of presenting innovations on mobile as well as wireless data engineering issues (such as those found in sensor networks). This workshop was the third in the MobiDE series, MobiDE 1999 having taken place in Seattle in conjunction with MobiCom 1999, and MobiDE 2001 having taken place in Santa Barbara in conjunction with SIGMOD 2001. SIGMOD Record Analytical processing of XML documents: opportunities and challenges. Rajesh Bordawekar,Christian A. Lang 2005 "Online Analytical Processing (OLAP) has been a valuable tool for analyzing trends in business information. While the multi-dimensional cube model used by OLAP is ideal for analyzing structured business data, it is not suitable for representing and analyzing complex semi-structured data, such as, XML documents. Need for analyzing XML documents is gaining urgency as XML has become the language of choice for data representation across a wide range of application domains. This paper describes a proposal for analyzing XML documents using the abstract XML tree model. We argue that OLAP's multi-dimensional aggregation operators can not express structurally complex analytical operations on XML documents. Hence, we outline new extensions to XQuery for supporting such complex analytical operations. Finally, we discuss various challenges in implementing XML analysis in a real system." SIGMOD Record Data management research at Technische Universität Darmstadt. Alejandro P. Buchmann,Mariano Cilia 2005 Data management research at Technische Universität Darmstadt. SIGMOD Record Containment of aggregate queries. Sara Cohen 2005 It is now common for databases to contain many gigabytes, or even many terabytes, of data. Scientific experiments in areas such as high energy physics produce data sets of enormous size, while in the business sector the emergence of decision-support systems and data warehouses has led organizations to build up gigantic collections of data. Aggregate queries allow one to retrieve concise information from such a database, since they can cover many data items while returning a small result. OLAP queries, used extensively in data warehousing, are based almost entirely on aggregation [4, 16]. Aggregate queries have also been studied in a variety of settings beyond relational databases, such as mobile computing [1], global information systems [21], stream data analysis [12], sensor networks [22] and constraint databases [2]. SIGMOD Record Data management research at the Middle East Technical University. Nihan Kesim Cicekli,Ahmet Cosar,Asuman Dogac,Faruk Polat,Pinar Senkul,Ismail Hakki Toroslu,Adnan Yazici 2005 The Middle East Technical University (METU) (http://www.metu.edu.tr) is the leading technical university in Turkey. The department of Computer Engineering (http://www.ceng.metu.edu.tr) has twenty seven faculty members with PhDs, 550 undergraduate students and 165 graduate-students. The major research funding sources include the Scientific and Technical Research Council of Turkey (TÜBÍTAK), the European Commission, and the internal research funds of METU. Data management research conducted in the department is summarized in this article. SIGMOD Record A tribute to Professor Hongjun Lu. Michael J. Carey,Jiawei Han 2005 Dr. Hongjun Lu, Professor of Computer Science, Hong Kong University of Science and Technology, lost his brave fight against cancer and left us in the evening of March 3, 2005. The world lost a dedicated and brilliant computer scientist. The database research community lost a respected and prolific researcher, an effortless organizer and promoter of database research in the world, a friend cherished by many colleagues and researchers, and a wonderful teacher who deeply affected and was revered by his students. SIGMOD Record Stonebraker receives IEEE John von Neumann Medal. David J. DeWitt,Michael J. Carey,Joseph M. Hellerstein 2005 "In December 2004, Michael Stonebraker was selected to receive the 2005 IEEE John von Neumann Medal for his ""contributions to the design, implementation, and commercialization of relational and object-relational database systems."" Mike is the first person from the database field selected to receive this award. He joins an illustrious group of former recipients, including Barbara Liskov (2004), Alfred Aho (2003), Ole-Johan Dahl and Kristen Nygaard (2002), Butler Lampson (2001), John Hennessy and David Patterson (2000), Douglas Engelbart (1999), Ivan Sutherland (1998), Maurice Wilkes (1997), Carver Mead (1996), Donald Knuth (1995), John Cocke (1994), Fred Brooks (1993), and Gordon Bell (1992)." SIGMOD Record XQuery 1.0 is nearing completion. Andrew Eisenberg,Jim Melton 2005 XQuery is a query language designed for querying real and virtual XML documents and collections of these documents. Its development began in the second half of 1999. We provided an early look at XQuery in Dec. 2002 [1]. XQuery 1.0 is now approaching its publication as a W3C Recommendation, and we would like to update you on its progress. We can speak to this area with even more authority than we did last time, as we both became co-chairs of the W3C XML Query Working Group [2] in summer 2004. SIGMOD Record Candidates for the upcoming ACM SIGMOD elections. Michael J. Franklin 2005 A clear sign of a healthy organization is the willingness of its members to volunteer their time to support and guide it. SIGMOD is particularly fortunate in this regard. As evidence is the slate of candidates listed on the following pages. These six people have agreed to run for positions as SIGMOD officers, and if elected, are committed to overseeing and improving the wide-ranging activities of SIGMOD and continuing the progress that has made us one of the leading SIGs in ACM. SIGMOD Record Tips on giving a good demo. Mary F. Fernández 2005 "For the first time this year, a ""Best Demonstrations"" session was included in the SIGMOD program. The first two days of the program included 24 demonstrations, each of which was presented during two of six interactive demo sessions. During the first two days, panels of three judges visited each demo group, each of whom was allotted 15 minutes to present their system to the judges. The friendly competition made for very exciting and noisy demo sessions!" SIGMOD Record On six degrees of separation in DBLP-DB and more. Ergin Elmacioglu,Dongwon Lee 2005 "An extensive bibliometric study on the db community using the collaboration network constructed from DBLP data is presented. Among many, we have found that (1) the average distance of all db scholars in the network has been stabilized to about 6 for the last 15 years, coinciding with the so-called six degrees of separation phenomenon; (2) In sync with Lotka's law on the frequency of publications, the db community also shows that a few number of scholars publish a large number of papers, while the majority of authors publish a small number of papers (i.e., following the power-law with exponent about -2); and (3) with the increasing demand to publish more, scholars collaborate more often than before (i.e., 3.93 collaborators per scholar and with steadily increasing clustering coefficients)." SIGMOD Record From databases to dataspaces: a new abstraction for information management. Michael J. Franklin,Alon Y. Halevy,David Maier 2005 "The development of relational database management systems served to focus the data management community for decades, with spectacular results. In recent years, however, the rapidly-expanding demands of ""data everywhere"" have led to a field comprised of interesting and productive efforts, but without a central focus or coordinated agenda. The most acute information management challenges today stem from organizations (e.g., enterprises, government agencies, libraries, ""smart"" homes) relying on a large number of diverse, interrelated data sources, but having no way to manage their dataspaces in a convenient, integrated, or principled fashion. This paper proposes dataspaces and their support systems as a new agenda for data management. This agenda encompasses much of the work going on in data management today, while posing additional research objectives." SIGMOD Record A snapshot of public web services. Jianchun Fan,Subbarao Kambhampati 2005 Web Service Technology has been developing rapidly as it provides a flexible application-to-application interaction mechanism. Several ongoing research efforts focus on various aspects of web service technology, including the modeling, specification, discovery, composition and verification of web services. The approaches advocated are often conflicting---based as they are on differing expectations on the current status of web services as well as differing models of their future evolution. One way of deciding the relative relevance of the various research directions is to look at their applicability to the currently available web services. To this end, we took a snapshot of the currently publicly available web services. Our aim is to get an idea of the number, type, complexity and composability of these web services and see if this analysis provides useful information about the near-term fruitful research directions. SIGMOD Record Mining data streams: a review. Mohamed Medhat Gaber,Arkady B. Zaslavsky,Shonali Krishnaswamy 2005 The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. These measurements are generated continuously and in a very high fluctuating data rates. Examples include sensor networks, web logs, and computer network traffic. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. The research in data stream mining has gained a high attraction due to the importance of its applications and the increasing generation of streaming information. Applications of data stream analysis can vary from critical scientific and astronomical applications to important business and financial ones. Algorithms, systems and frameworks that address streaming challenges have been developed over the past three years. In this review paper, we present the state-of-the-art in this growing vital field. SIGMOD Record XML database support for distributed execution of data-intensive scientific workflows. Shannon Hastings,Matheus Ribeiro,Stephen Langella,Scott Oster,Ümit V. Çatalyürek,Tony Pan,Kun Huang,Renato Ferreira,Joel H. Saltz,Tahsin M. Kurç 2005 In this paper we look at the application of XML data management support in scientific data analysis workflows. We describe a software infrastructure that aims to address issues associated with metadata management, data storage and management, and execution of data analysis workflows on distributed storage and compute platforms. This system couples a distributed, filter-stream based dataflow engine with a distributed XML-based data and metadata management system. We present experimental results from a biomedical image analysis use case that involves processing of digitized microscopy images for feature segmentation. SIGMOD Record Report on the ninth conference on Software Engineering and Databases (JISBD 2004). Juan Hernández,Ernesto Pimentel,José Ambrosio Toval Álvarez 2005 Report on the ninth conference on Software Engineering and Databases (JISBD 2004). SIGMOD Record LiXQuery: a formal foundation for XQuery research. Jan Hidders,Philippe Michiels,Jan Paredaens,Roel Vercammen 2005 XQuery is considered to become the standard query language for XML documents. However, the complete XQuery syntax and semantics seem too complicated for research and educational purposes. By defining a concise backwards compatible subset of XQuery with a complete formal description, we provide a practical foundation for XQuery research. We pay special attention to usability by supporting the most typical XQuery expressions. SIGMOD Record Tools for composite web services: a short overview. Richard Hull,Jianwen Su 2005 "Web services technologies enable flexible and dynamic interoperation of autonomous software and information systems. A central challenge is the development of modeling techniques and tools for eanbling the (semi-)automatic composition and analysis of these services, taking into account their semantic and behavioral properties. This paper presents an overview of the fundamental assumptions and concepts underlying current work on service composition, and provides a sampling of key results in the area. It also provides a brief tour of several composition models including semantic web services, the ""Roman"" model, and the Mealy / conversation model." SIGMOD Record Scientific data management in the coming decade. Jim Gray,David T. Liu,María A. Nieto-Santisteban,Alexander S. Szalay,David J. DeWitt,Gerd Heber 2005 Scientific instruments and computer simulations are creating vast data stores that require new scientific methods to analyze and organize the data. Data volumes are approximately doubling each year. Since these new instruments have extraordinary precision, the data quality is also rapidly improving. Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements. SIGMOD Record RedBD: the database research community in Spain. Arantza Illarramendi,Esperanza Marcos,Carmen Costilla 2005 During the last decade, the Database research community in Spain has grown significantly in the quantity of groups interested in the area, and especially in the quality of those groups. Those database research groups are not in general very large; usually five or six full-time researchers compose them. The economic resources for supporting the researching activity came from public organisms like the Spanish government, the European Union or local governments, and in a less rate from the industry. Research is carried out mainly at the Informatics Departments of the Universities. As the different groups usually share research topics and also fields of applications, in the last years there has been an important movement to join research efforts. SIGMOD Record Research issues in automatic database clustering. Sylvain Guinepain,Le Gruenwald 2005 While a lot of work has been published on clustering of data on storage medium, little has been done about automating this process. This is an important area because with data proliferation, human attention has become a precious and expensive resource. Our goal is to develop an automatic and dynamic database clustering technique that will dynamically re-cluster a database with little intervention of a database administrator (DBA) and maintain an acceptable query response time at all times. In this paper we describe the issues that need to be solved when developing such a technique. SIGMOD Record The atomic manifesto: a story in four quarks. Cliff B. Jones,David B. Lomet,Alexander B. Romanovsky,Gerhard Weikum,Alan Fekete,Marie-Claude Gaudel,Henry F. Korth,Rogério de Lemos,J. Eliot B. Moss,Ravi Rajwar,Krithi Ramamritham,Brian Randell,Luís Rodrigues 2005 "This paper is based on a five-day workshop on ""Atomicity in System Design and Execution"" that took place in Schloss Dagstuhl in Germany [5] in April 2004 and was attended by 32 people from different scientific communities. The participants included researchers from the four areas of• database and transaction processing systems,• fault tolerance and dependable systems,• formal methods for system design and correctness reasoning, and• to a smaller extent, hardware architecture and programming languages." SIGMOD Record No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. Jin Li,David Maier,Kristin Tufte,Vassilis Papadimos,Peter A. Tucker 2005 "Windows queries are proving essential to data-stream processing. In this paper, we present an approach for evaluating sliding-window aggregate queries that reduces both space and computation time for query execution. Our approach divides overlapping windows into disjoint panes, computes sub-aggregates over each pane, and ""rolls up"" the pane-aggregates to computer window-aggregates. Our experimental study shows that using panes has significant performance benefits." SIGMOD Record Report on the 19 Brazilian symposium on databases (SBBD 2004). Sérgio Lifschitz,Alberto H. F. Laender 2005 "The Brazilian Symposium on Databases (SBBD) is an annual event promoted by the Brazilian Computer Society (SBC) through its Database Special Committee. In 2004, the 19th edition of SBBD was held in Brasília, Brazil's capital, on 18-20 October, organized by the Computer Science Department of the University of Brasília (UnB). As in the previous years, SBBD 2004 received the in-cooperation status from ACM SIGMOD and was partially supported by the VLDB Endowment, thus confirming the recognition of the international community of SBBD as the most important database event in Latin America." SIGMOD Record Peer-to-peer management of XML data: issues and research challenges. Georgia Koloniari,Evaggelia Pitoura 2005 Peer-to-peer (p2p) systems are attracting increasing attention as an efficient means of sharing data among large, diverse and dynamic sets of users. The widespread use of XML as a standard for representing and exchanging data in the Internet suggests using XML for describing data shared in a p2p system. However, sharing XML data imposes new challenges in p2p systems related to supporting advanced querying beyond simple keyword-based retrieval. In this paper, we focus on data management issues for processing XML data in a p2p setting, namely indexing, replication, clustering and query routing and processing. For each of these topics, we present the issues that arise, survey related research and highlight open research problems. SIGMOD Record "Report on the 1st International Symposium on the Applications of Constraint Databases (CDB'04)." Bart Kuijpers,Peter Z. Revesz 2005 "The 1st International Symposium on the Applications of Constraint Databases (CDB'04) was held on June 12-13, 2004, just before the ACM SIGMOD and PODS conferences, in the Amphithéatre de Chimie of the Université Pierre et Marie Curie in Paris, France. We acted as program committee chairs and Irène Guessarian and Patrick Cégielski as local organization chairs." SIGMOD Record "Guest editors' introduction to the special section on scientific workflows." Bertram Ludäscher,Carole A. Goble 2005 "Business-oriented workflows have been studied since the 70's under various names (office automation, workflow management, business process management) and by different communities, including the database community. Much basic and applied research has been conducted over the years, e.g. theoretical studies of workflow languages and models (based on Petri-nets or process calculi), their properties, transactional behavior, etc." SIGMOD Record Simplifying construction of complex workflows for non-expert users of the Southern California Earthquake Center Community Modeling Environment. Philip Maechling,Hans Chalupsky,Maureen Dougherty,Ewa Deelman,Yolanda Gil,Sridhar Gullapalli,Vipin Gupta,Carl Kesselman,Jihie Kim,Gaurang Mehta,Brian Mendenhall,Thomas A. Russ,Gurmeet Singh,Marc Spraragen,Garrick Staples,Karan Vahi 2005 Workflow systems often present the user with rich interfaces that express all the capabilities and complexities of the application programs and the computing environments that they support. However, non-expert users are better served with simple interfaces that abstract away system complexities and still enable them to construct and execute complex workflows. To explore this idea, we have created a set of tools and interfaces that simplify the construction of workflows. Implemented as part of the Community Modeling Environment developed by the Southern California Earthquake Center, these tools, are integrated into a comprehensive workflow system that supports both domain experts as well as non expert users. SIGMOD Record Semantic characterizations of navigational XPath. Maarten Marx,Maarten de Rijke 2005 We give semantic characterizations of the expressive power of navigational XPath (a.k.a. Core XPath) in terms of first order logic. XPath can be used to specify sets of nodes and sets of paths in an XML document tree. We consider both uses. For sets of nodes, XPath is equally expressive as first order logic in two variables. For paths, XPath can be defined using four simple connectives, which together yield the class of first order definable relations which are safe for bisimulation. Furthermore, we give a characterization of the XPath expressible paths in terms of conjunctive queries. SIGMOD Record An approach for pipelining nested collections in scientific workflows. Timothy M. McPhillips,Shawn Bowers 2005 We describe an approach for pipelining nested data collections in scientific workflows. Our approach logically delimits arbitrarily nested collections of data tokens using special, paired control tokens inserted into token streams, and provides workflow components with high-level operations for managing these collections. Our framework provides new capabilities for: (1) concurrent operation on collections; (2) on-the-fly customization of workflow component behavior; (3) improved handling of exceptions and faults; and (4) transparent passing of provenance and metadata within token streams. We demonstrate our approach using a workflow for inferring phylogenetic trees. We also describe future extensions to support richer typing mechanisms for facilitating sharing and reuse of workflow components between disciplines. This work represents a step towards our larger goal of exploiting collection-oriented dataflow programming as a new paradigm for scientific workflow systems, an approach we believe will significantly reduce the complexity of creating and reusing workflows and workflow components. SIGMOD Record WOODSS and the Web: annotating and reusing scientific workflows. Claudia Bauzer Medeiros,José de Jesús Pérez Alcázar,Luciano A. Digiampietri,Gilberto Zonta Pastorello Jr.,André Santanchè,Ricardo da Silva Torres,Edmundo Roberto Mauro Madeira,Evandro Bacarin 2005 "This paper discusses ongoing research on scientific workflows at the Institute of Computing, University of Campinas (IC - UNICAMP) Brazil. Our projects with bio-scientists have led us to develop a scientific workflow infrastructure named WOODSS. This framework has two main objectives in mind: to help scientists to specify and annotate their models and experiments; and to document collaborative efforts in scientific activities. In both contexts, workflows are annotated and stored in a database. This ""annotated scientific workflow"" database is treated as a repository of (sometimes incomplete) approaches to solving scientific problems. Thus, it serves two purposes: allows comparison of distinct solutions to a problem, and their designs; and provides reusable and executable building blocks to construct new scientific workflows, to meet specific needs. Annotations, moreover, allow further insight into methodology, success rates, underlying hypotheses and other issues in experimental activities.The many research challenges faced by us at the moment include: the extension of this framework to the Web, following Semantic Web standards; providing means of discovering workflow components on the Web for reuse; and taking advantage of planning in Artificial Intelligence to support composition mechanisms. This paper describes our efforts in these directions, tested over two domains - agro-environmental planning and bioinformatics." SIGMOD Record In memoriam Alberto Oscar Mendelzon. Renée J. Miller 2005 Alberto Oscar Mendelzon passed away on June 16, 2005 after a two-year battle with cancer. This tribute to Alberto and his achievements is written in recognition of his great intellect and his generous friendship. Both have influenced and inspired many in the database research community. SIGMOD Record The Indiana Center for Database Systems at Purdue University. Mourad Ouzzani,Walid G. Aref,Elisa Bertino,Ann Christine Catlin,Christopher W. Clifton,Wing-Kai Hon,Ahmed K. Elmagarmid,Arif Ghafoor,Susanne E. Hambrusch,Sunil Prabhakar,Jeffrey Scott Vitter,Xiang Zhang 2005 The Indiana Center for Database Systems (ICDS) at Purdue University has embarked in an ambitious endeavor to become a premiere world-class database research center. This goal is substantiated by the diversity of its research topics, the large and diverse funding base, and the steady publication trend in top conferences and journals. ICDS was founded with an initial grant from the State of Indiana Corporation of Science and Technology in 1990. Since then it has grown to now have 9 faculty members and about 30 total researchers. This report describes the major research projects underway at ICDS as well as efforts to move research toward practice. SIGMOD Record Citation analysis of database publications. Erhard Rahm,Andreas Thor 2005 We analyze citation frequencies for two main database conferences (SIGMOD, VLDB) and three database journals (TODS, VLDB Journal, Sigmod Record) over 10 years. The citation data is obtained by integrating and cleaning data from DBLP and Google Scholar. Our analysis considers different comparative metrics per publication venue, in particular the total and average number of citations as well as the impact factor which has so far only been considered for journals. We also determine the most cited papers, authors, author institutions and their countries. SIGMOD Record Reminiscences on influential papers. Kenneth A. Ross 2005 "Unfortunately, this will be my last influential papers column. I've been editor for about five years now (how time flies!) and have enjoyed it immensely. I've always found it rewarding to step back and look at why we do the research we do, and this column makes a big contribution to the process of self-examination. Further, I feel that there's a strong need for ways to publicly and explicitly highlight ""quality"" in papers. Criticism is easy, and is the more common experience given the amount of reviewing (and being reviewed) we typically engage in. I look forward to seeing this column in future issues." SIGMOD Record Reminiscences on influential papers. Kenneth A. Ross 2005 Reminiscences on influential papers. SIGMOD Record Reminiscences on influential papers. Kenneth A. Ross 2005 "Unfortunately, this will be my last influential papers column. I've been editor for about five years now (how time flies!) and have enjoyed it immensely. I've always found it rewarding to step back and look at why we do the research we do, and this column makes a big contribution to the process of self-examination. Further, I feel that there's a strong need for ways to publicly and explicitly highlight ""quality"" in papers. Criticism is easy, and is the more common experience given the amount of reviewing (and being reviewed) we typically engage in. I look forward to seeing this column in future issues." SIGMOD Record Reminiscences on influential papers. Kenneth A. Ross,Rada Chirkova,Dimitrios Gunopulos,Rachel Pottinger,Jun Yang,Jingren Zhou 2005 Reminiscences on influential papers. SIGMOD Record Query answering exploiting structural properties. Francesco Scarcello 2005 We review the notion of hypertree width, a measure of the degree of cyclicity of hypergraphs that is useful for identifying and solving efficiently easy instances of hard problems, by exploiting their structural properties. Indeed, a number of relevant problems from different areas, such as database theory, artificial intelligence, and game theory, are tractable when their underlying hypergraphs have small (i.e., bounded by some fixed constant) hypertree width. In particular, we describe how this notion may be used for identifying tractable classes of database queries and answering such queries in an efficient way. SIGMOD Record "Report on the first IEEE international workshop on networking meets databases (NetDB'05)." Cyrus Shahabi,Ramesh Govindan,Karl Aberer 2005 In this report, to the best of our ability, we try to summarize the presentations and discussions occurred within the First IEEE International Workshop on Networking Meets Databases (NetDB) which was held in Tokyo Japan on April 8th and 9th, 2005. NetDB was one of the many (11 to be exact) satellite workshops of the IEEE ICDE (International Conference on Data Engineering) 2005 conference. This workshop is part of the very few initiatives in bringing the networking and database communities together. The focus research areas of NetDB 2005 were sensor and peer-to-peer networks. SIGMOD Record Integrating databases and workflow systems. Srinath Shankar,Ameet Kini,David J. DeWitt,Jeffrey F. Naughton 2005 There has been an information explosion in fields of science such as high energy physics, astronomy, environmental sciences and biology. There is a critical need for automated systems to manage scientific applications and data. Database technology is well-suited to handle several aspects of workflow management. Contemporary workflow systems are built from multiple, separately developed components and do not exploit the full power of DBMSs in handling data of large magnitudes. We advocate a holistic view of a WFMS that includes not only workflow modeling but planning, scheduling, data management and cluster management. Thus, it is worthwhile to explore the ways in which databases can be augmented to manage workflows in addition to data. We present a language for modeling workflows that is tightly integrated with SQL. Each scientific program in a workflow is associated with an active table or view. The definition of data products is in relational format, and invocation of programs and querying is done in SQL. The tight coupling between workflow management and data-manipulation is an advantage for data-intensive scientific programs. SIGMOD Record A citation-based system to assist prize awarding. Antonis Sidiropoulos,Yannis Manolopoulos 2005 "Citation analysis is performed to evaluate the impact of scientific collections (journals and conferences), publications and scholar authors. In this paper we investigate alternative methods to provide a generalized approach to rank scientific publications. We use the SCEAS system [12] as a base platform to introduce new methods that can be used for ranking scientific publications. Moreover, we tune our approach along the reasoning of the prizes 'VLDB 10 Year Award' and 'SIGMOD Test of Time Award', which have been awarded in the course of the top two database conferences. Our approach can be used to objectively suggest the publications and the respective authors the are more likely to be awarded in the near future at these conferences." SIGMOD Record A survey of data provenance in e-science. Yogesh Simmhan,Beth Plale,Dennis Gannon 2005 Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field. SIGMOD Record The 8 requirements of real-time stream processing. Michael Stonebraker,Ugur Çetintemel,Stanley B. Zdonik 2005 "Applications that require real-time processing of high-volume data steams are pushing the limits of traditional data processing infrastructures. These stream-based applications include market feed processing and electronic trading on Wall Street, network and infrastructure monitoring, fraud detection, and command and control in military environments. Furthermore, as the ""sea change"" caused by cheap micro-sensor technology takes hold, we expect to see everything of material significance on the planet get ""sensor-tagged"" and report its state or location in real time. This sensorization of the real world will lead to a ""green field"" of novel monitoring and control applications with high-volume and low-latency processing requirements.Recently, several technologies have emerged---including off-the-shelf stream processing engines---specifically to address the challenges of processing high-volume, real-time data without requiring the use of custom code. At the same time, some existing software technologies, such as main memory DBMSs and rule engines, are also being ""repurposed"" by marketing departments to address these applications.In this paper, we outline eight requirements that a system software should meet to excel at a variety of real-time stream processing applications. Our goal is to provide high-level guidance to information technologists so that they will know what to look for when evaluation alternative stream processing solutions. As such, this paper serves a purpose comparable to the requirements papers in relational DBMSs and on-line analytical processing. We also briefly review alternative system software technologies in the context of our requirements.The paper attempts to be vendor neutral, so no specific commercial products are mentioned." SIGMOD Record Developments at ACM . Richard T. Snodgrass 2005 Developments at ACM . SIGMOD Record CMM and TODS. Richard T. Snodgrass 2005 The Capability Maturity Model [4] is an orderly way for organizations to determine the capabilities of their current processes for developing software and to establish priorities for improvement [2]. It defines five levels of progressively more mature process capability [3]. SIGMOD Record Building data mining solutions with OLE DB for DM and XML for analysis. ZhaoHui Tang,Jamie Maclennan,Pyungchul (Peter) Kim 2005 A data mining component is included in Microsoft SQL Server 2000 and SQL Server 2005, one of the most popular DBMSs. This gives a push for data mining technologies to move from a niche towards the mainstream. Apart from a few algorithms, the main contribution of SQL Server Data Mining is the implementation of OLE DB for Data Mining. OLE DB for Data mining is an industrial standard led by Microsoft and supported by a number of ISVs. It leverages two existing relational technologies: SQL and OLE DB. It defines a SQL language for data mining based on a relational concept. More recently, Microsoft, Hyperion, SAS and a few other BI vendors formed the XML for Analysis Council. XML for Analysis covers both OLAP and Data Mining. The goal is to allow consumer applications to query various BI packages from different platforms. This paper gives an overview of OLE DB for Data Mining and XML for Analysis. It also shows how to build data mining application using these APIs. SIGMOD Record Changes to the editorial board. Richard T. Snodgrass 2005 "The December issue should be out soon, if it isn't already. It is a special issue of papers invited from the SIGMOD'04 and PODS'04 conferences. I thank the authors, associate editors, and reviewers (some from the SIGMOD and PODS program committees) for writing, reviewing, revising, reviewing again, and then getting into final form these papers in the amazingly short time of eighteen months. This is a very nice issue, with seven substantive papers." SIGMOD Record "Report on the International Workshop on Pattern Representation and Management (PaRMa'04)." Yannis Theodoridis,Panos Vassiliadis 2005 The increasing ability to quickly collect and cheaply store large volumes of data, and the need for extracting concise information to be efficiently manipulated and intuitively analyzed, are posing new requirements for Database Management Systems (DBMS) in both industrial and scientific applications. A common approach to deal with huge data volumes is to reduce the available information to knowledge artifacts (i.e., clusters, rules, etc.), hereafter called patterns, through data processing methods (pattern recognition, data mining, knowledge extraction). Patterns reduce the number and size of the original information to manageable size while preserving as much as possible its hidden / interesting content. In order to efficiently and effectively deal with patterns, academic groups and industrial consortiums have recently devoted efforts towards modeling, storage, retrieval, analysis and manipulation of patterns with results mainly in the areas of Inductive Databases and Pattern Base Management Systems (PBMS). SIGMOD Record Report on the workshop on wrapper techniques for legacy data systems. Philippe Thiran,Tore Risch,Carmen Costilla,Jean Henrard,Thomas Kabisch,Johan Petrini,Willem-Jan van den Heuvel,Jean-Luc Hainaut 2005 This report summarizes the presentations and discussions of the first workshop on Wrapper Techniques for Legacy Data Systems which was held in Delft on November 12 2004. This workshop was co-located with the 2004 WCRE conference. This workshop entails to our best knowledge the first in its kind, concentrating on challenging research issues regarding the development of wrappers for legacy data systems. SIGMOD Record Nested intervals tree encoding in SQL. Vadim Tropashko 2005 Nested Intervals generalize Nested Sets. They are immune to hierarchy reorganization problem. They allow answering ancestor path hierarchical queries algorithmically - without accessing the stored hierarchy relation. SIGMOD Record Database research at Bilkent University. Özgür Ulusoy 2005 Database research at Bilkent University. SIGMOD Record Efficient calendar based temporal association rule. Keshri Verma,Om Prakash Vyas 2005 Associationship is an important component of data mining. In real world data the knowledge used for mining rule is almost time varying. The item have the dynamic characteristic in terms of transaction, which have seasonal selling rate and it hold time-based associationship with another item. It is also important that in database, some items which are infrequent in whole dataset but those may be frequent in a particular time period. If these items are ignored then associationship WVW200R3100221-398 result will no longer be accurate. To restrict the time based associationship calendar based pattern can be used [YPXS03]. A calendar unit such as months and days, clock units, such as hours and seconds & specialized units, such as business days and academic years, play a major role in a wide range of information system applications[BX00].Most of the popular associationship rule mining methods are having performance bottleneck for database with different characteristics. Some of the methods are efficient for sparse dataset where as some are good for a dense dataset. Our focus is to find effective time sensitive algorithm using H-struct called temporal H-mine, which takes the advantage of this data structure and dynamically adjusts links in the mining process [PHNTY01]. It is faster in traversing & advantage of precisely predictable spaces overhead. It can be scaled up to large database by database partitioning, end when dataset becomes dense, conditionally temporal FP-tree. can be constructed dynamically as part of mining. SIGMOD Record Scheduling of scientific workflows in the ASKALON grid environment. Marek Wieczorek,Radu Prodan,Thomas Fahringer 2005 "Scheduling is a key concern for the execution of performance-driven Grid applications. In this paper we comparatively examine different existing approaches for scheduling of scientific workflow applications in a Grid environment. We evaluate three algorithms namely genetic, HEFT, and simple ""myopic"" and compare incremental workflow partitioning against the full-graph scheduling strategy. We demonstrate experiments using real-world scientific applications covering both balanced (symmetric) and unbalanced (asymmetric) workflows. Our results demonstrate that full-graph scheduling with the HEFT algorithm performs best compared to the other strategies examined in this paper." SIGMOD Record Databases in Virtual Organizations: a collective interview and call for researchers. Marianne Winslett 2005 "When the Databases in Virtual Organizations (DIVO) workshop convened after SIGMOD 2004 in Paris, many of us attending weren't sure what a virtual organization was, much less what relevance it could have to database research. Five hours later, as the lights snapped off in the rest of the building and the maintenance crew hovered patiently outside our meeting room, we had become a group with a mission: to let the database research community know what an incredible idea generator and testbed virtual organizations could be for research on information integration and data security." SIGMOD Record Bruce Lindsay speaks out: on System R, benchmarking, life as an IBM fellow, the power of DBAs in the old days, why performance still matters, Heisenbugs, why he still writes code, singing pigs, and more. Marianne Winslett 2005 "Welcome to ACM SIGMOD Record's series of interviews with distinguished members of the database community. I'm Marianne Winslett, and today we're in San Diego at the 2003 SIGMOD and PODS conferences. I have here with me Bruce Lindsay, a member of the research staff at IBM Almaden Research Center. Bruce is well-known for his work on relational databases, which has been very influential both inside and outside of IBM, starting with his work on the System R project. Bruce is an IBM Fellow and his PhD is from Berkeley. So Bruce, welcome!" SIGMOD Record John Wilkes speaks out: on what the DB community needs to know about storage, how the DB and storage communities can join forces and change the world, and more. Marianne Winslett 2005 "Welcome to this installment of ACM SIGMOD Record's series of interviews with distinguished members of the database community. I'm Marianne Winslett, and today [February 2004] we are at the Department of Computer Science at the University of Illinois at Urbana-Champaign. I have here with me John Wilkes, who is an HP Fellow in the Internet Systems and Storage Laboratory at Hewlett Packard Laboratories in Palo Alto, California, where his research focuses on the design and management of storage systems. John is a member of the editorial board of ACM Transactions on Computer Systems, and until recently he was a member of the Technical Council of the Storage Network Industry Association. John is an ACM Fellow, and his PhD is from the University of Cambridge. So, John, welcome!" SIGMOD Record Christos Faloutsos speaks out: on power laws, fractals, the future of data mining, sabbaticals, and more. Marianne Winslett 2005 "Welcome to this installment of ACM SIGMOD Record's series of interviews with distinguished members of the database community. I'm Marianne Winslett, and today I have here with me Christos Faloutsos, who is a professor of computer science at Carnegie Mellon University. Christos recieved the Presidential Young Investigator Award from the National Science Foundation in 1989. He received the 1997 VLDB Ten Year Paper Award for his paper on R+ trees, and the SIGMOD 1994 Best Paper Award for a paper on fast subsequence matching in time series databases. Christos is a member of the SIGKDD Executive Committee, and he has wide-ranging interests in data mining, database performance, and spatial and multimedia databases. His PhD is from the University of Toronto. So, Christos, welcome!" SIGMOD Record A unified spatiotemporal schema for representing and querying moving features. Rong Xie,Ryosuke Shibasaki 2005 "A conceptual schema is essentially required to effectively and efficiently manage and manipulate dynamically and continuously changing data and information of moving features. In the paper, spatiotemporal schema (STS) is proposed to describe characteristics of moving features and to efficiently manage moving features data, including the necessity aspects: abstract data types, dynamic attributes, spatiotemporal topological relationships and a minimum set of spatiotemporal operations. On the basis of the proposal of schema, spatiotemporal object-based class library (STOCL) is further developed for the implementation of STS, which allows development of various spatiotemporal queries and simulations. The conceptual schema and implemented object library are then applied to the development of passengers' movement simulation and pattern analysis in railway stations in Tokyo." SIGMOD Record A taxonomy of scientific workflow systems for grid computing. Jia Yu,Rajkumar Buyya 2005 With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research. SIGMOD Record A notation and system for expressing and executing cleanly typed workflows on messy scientific data. Yong Zhao,James E. Dobson,Ian T. Foster,Luc Moreau,Michael Wilde 2005 "The description, composition, and execution of even logically simple scientific workflows are often complicated by the need to deal with ""messy"" issues like heterogeneous storage formats and ad-hoc file system structures. We show how these difficulties can be overcome via a typed, compositional workflow notation within which issues of physical representation are cleanly separated from logical typing, and by the implementation of this notation within the context of a powerful runtime system that supports distributed execution. The resulting notation and system are capable both of expressing complex workflows in a simple, compact form, and of enacting those workflows in distributed environments. We apply our technique to cognitive neuroscience workflows that analyze functional MRI image data, and demonstrate significant reductions in code size relative to other approaches."