Perm (Provenance Extension of the Relational Model) - Efficient Provenance Support for Relational Databases TRAMP (TRAnsformation Mapping Provenance) - Understanding the Behavior of Schema Mappings though Provenance and Meta-querying Vagabond - Automatic Generation of Explanations for Data Exchange Errors Ariadne - Computing fine-grained Provenance for Data Streams using Operator Instrumentation
Professional Service
Program Committee Member
2012
Alberto Mendelzon Workshop on Foundations of Data Management (AMW)
2011
ACM SIGMOD International Conference on Management of Data (SIGMOD)
Journal Reviews
2011
ACM Transactions on Database Systems (TODS)
IEEE Transactions on Knowledge and Data Engineering (TKDE)
External Reviewer
2010
IEEE International Conference on Data Engineering (ICDE)
2009
Database Systems for Business, Technology, and Web (BTW)
2008
IEEE International Conference on Data Engineering (ICDE)
2007
International Conference on Very Large Databases (VLDB)
ACM SIGMOD International Conference on Management of Data (SIGMOD)
International Conference on Objects, Models, Components, Patterns (TOOLS)
International Conference on Business Information Systems (BIS)
2006
International Conference on Very Large Databases (VLDB)
Publications
group by:
generated by
2011 (5)
Reexamining Some Holy Grails of Data Provenance.Glavic, B., and Miller, R. J.2011.In 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP). PdfBibtexAbstract:
We reconsider some of the explicit and implicit properties that underlie well-established definitions of data provenance semantics. Previous work on comparing provenance semantics has mostly focused on expressive power (does the provenance generated by a certain semantics subsume the provenance generated by other semantics) and on understanding whether a semantics is insensitive to query rewrite (i.e., do equivalent queries have the same provenance). In contrast, we try to investigate why certain semantics possess specific properties (like insensitivity) and whether these properties are always desirable. We present a new property stability with respect to query language extension that, to the best of our knowledge, has not been isolated and studied on its own.
In this paper, we present Vagabond, a system that uses a novel holistic approach to help users to understand and debug data exchange scenarios. Developing such a scenario is a complex and labor-intensive process where errors are often only revealed in the target instance produced as the result of this process. This makes it very hard to debug such scenarios, especially for non-power users. Vagabond aides a user in debugging by automatically generating possible explanations for target instance errors identified by the user.
Declarative Serializable Snapshot Isolation.Tilgner, C.; Glavic, B.; Böhlen, M. H.; and Kanne, C.-C.2011.In Proceedings of the 15th International Conference on Advances in Database and Information Systems (ADBIS), 170-184. PdfPdfBibtexAbstract:
Snapshot isolation (SI) is a popular concurrency control pro- tocol, but it permits non-serializable schedules that violate database integrity. The Serializable Snapshot Isolation (SSI) protocol ensures (view) serializability by preventing pivot structures in SI schedules. In this paper, we leverage the SSI approach and develop the Declarative Serializable Snapshot Isolation (DSSI) protocol, an SI protocol that guarantees serializable schedules. Our approach requires no analysis of application programs or changes to the underlying DBMS. We present an implementation and prove that it ensures serializability.
The current state of the art for provenance in data stream management systems (DSMS) is to provide provenance at a high level of abstraction (such as, from which sensors in a sensor network an aggregated value is derived from). This limitation was imposed by high-throughput requirements and an anticipated lack of application demand for more detailed provenance information. In this work, we first demonstrate by means of well-chosen use cases that this is a misconception, i.e., coarse-grained provenance is in fact insufficient for many application domains. We then analyze the requirements and challenges involved in integrating support for fine-grained provenance into a streaming system and outline a scalable solution for supporting tuple-level provenance in DSMS.
Smile: Enabling Easy and Fast Development of Domain-Specific Scheduling Protocols.Tilgner, C.; Glavic, B.; Böhlen, M. H.; and Kanne, C.-C.2011.In Proceedings of the 28th British National Conference on Databases (BNCOD), 128-131. PdfPdfBibtexAbstract:
Modern server systems schedule large amounts of concurrent requests constrained by, e.g., correctness criteria and service-level agreements. Since standard database management systems provide only limited consistency levels, the state of the art is to develop schedulers imperatively which is time-consuming and error-prone. In this poster, we present Smile (declarative Scheduling MIddLEware), a tool for developing domain-specific scheduling protocols declaratively. Smile decreases the effort to implement and adapt such protocols because it abstracts from low level scheduling details allowing developers to focus on the pro- tocol implementation. We demonstrate the advantages of our approach by implementing a domain-specific use case protocol.
2010 (4)
Perm: Efficient Provenance Support for Relational Databases.Glavic, B.2010.Ph.D. Thesis, University of Zurich. PdfBibtexAbstract:
In many application areas like scientific computing, data-warehousing, and data integration detailed information about the origin of data is required. This kind of information is often referred to as data provenance. The provenance of a piece of data, a so-called data item, includes information about the source data from which it is derived and the transformations that lead to its creation and current representation. In the context of relational databases, provenance has been studied both from a theoretical and algorithmic perspective. Yet, in spite of the advances made, there are very few practical systems available that support generating, querying and storing provenance information (We refer to such systems as provenance management systems or PMS). These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested sub-queries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. Furthermore, existing approaches use different data models to represent provenance and the data for which provenance is computed (normal data). This has the intrinsic disadvantage that a new query language has to be developed for querying provenance information. Naturally, such a query language is not as powerful and mature as, e.g., SQL. In this thesis we present Perm, a novel relational provenance management system that addresses the shortcoming of existing approaches discussed above. The underlying idea of Perm is to represent provenance information as standard relations and to generate and query it using standard SQL queries; ''Use SQL to compute and query the provenance of SQL queries''. Perm is implemented on top of PostgreSQL extending its SQL dialect with provenance features that are implemented as query rewrites. This approach enables the system to take full benefit from the advanced query optimizer of PostgreSQL and provide full SQL query support for provenance information. Several important steps were necessary to realize our vision of a ''purely relational'' provenance management system that is capable of generating provenance information for complex SQL queries. We developed new notions of provenance that handle SQL constructs not covered by the standard definitions of provenance. Based on these provenance definitions rewrite rules for relational algebra expressions are defined for transforming an algebra expression q into an algebra expression that computes the provenance of q (These rewrites rules are proven to produce correct and complete results). The implementation of Perm, based on this solid theoretical foundation, applies a variety of novel optimization techniques that reduce the cost of some intrinsically expensive provenance operations. By applying the Perm system to schema mapping debugging - a prominent use case for provenance - and extensive performance measurements we confirm the feasibility of our approach and the superiority of Perm over alternative approaches.
TRAMP: Understanding the Behavior of Schema Mappings through Provenance.Glavic, B.; Alonso, G.; Miller, R. J.; and Haas, L. M.2010.Proceedings of the Very Large Data Bases Endowment (PVLDB), 3(1):1314-1325. PdfBibtexAbstract:
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. In this paper, we present TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide query support for transformations, data, and all forms of provenance. We formally define transformation and mapping provenance, present an efficient implementation of both forms of provenance, and evaluate the resulting system through extensive experiments.
Formal Foundation of Contribution Semantics and Provenance Computation through Query Rewrite in TRAMP.Glavic, B.2010.University of Zurich. PdfBibtex
2009 (3)
The Perm Provenance Management System in Action.Glavic, B., and Alonso, G.2009.In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD) (Demonstration Track), 1055-1058. PdfBibtexAbstract:
In this demonstration we present the Perm provenance management system (PMS). Perm is capable of computing, storing and querying provenance information for the relational data model. Provenance is computed by using query rewriting techniques to annotate tuples with provenance information. Thus, provenance data and provenance computations are represented as relational data and queries and, hence, can be queried, stored and optimized using standard relational database techniques. This demo shows the complete Perm system and lets attendants examine in detail the process of query rewriting and provenance retrieval in Perm, the most complete data provenance system available today. For example, Perm supports lazy and eager provenance computation, external provenance and various contribution semantics.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting.Glavic, B., and Alonso, G.2009.In Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE), 174-185. PdfBibtexAbstract:
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
Provenance for Nested Subqueries.Glavic, B., and Alonso, G.2009.In Proceedings of the 12th International Conference on Extending Database Technology (EDBT), 982-993. PdfBibtexAbstract:
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
2008 (1)
Clustering Multidimensional Sequences in Spatial and Temporal Databases.Assent, I.; Krieger, R.; Glavic, B.; and Seidl, T.2008.International Journal on Knowledge and Information Systems (KAIS), 16(1):29-51. PdfBibtexAbstract:
Many environmental, scientific, technical or medical database applications require effective and efficient mining of time series, sequences or trajectories of measurements taken at different time points and positions forming large temporal or spatial databases. Particularly the analysis of concurrent and multidimensional sequences poses new challenges in finding clusters of arbitrary length and varying number of attributes. We present a novel algorithm capable of finding parallel clusters in different subspaces and demonstrate our results for temporal and spatial applications. Our analysis of structural quality parameters in rivers is successfully used by hydrologists to develop measures for river quality improvements.
2007 (1)
Data Provenance: A Categorization of Existing Approaches.Glavic, B., and Dittrich, K. R.2007.In Proceedings of Datenbanksysteme in Buisness, Technologie und Web (BTW), 227-241. PdfBibtexAbstract:
In many application areas like e-science and data-warehousing detailed information about the origin of data is required. This kind of information is often referred to as data provenance or data lineage. The provenance of a data item includes information about the processes and source data items that lead to its creation and current representation. The diversity of data representation models and application domains has lead to a number of more or less formal definitions of provenance. Most of them are limited to a special application domain, data representation model or data processing facility. Not surprisingly, the associated implementations are also restricted to some application domain and depend on a special data model. In this paper we give a survey of data provenance models and prototypes, present a general categorization scheme for provenance models and use this categorization scheme to study the properties of the existing approaches. This categorization enables us to distinguish between different kinds of provenance information and could lead to a better understanding of provenance in general. Besides the categorization of provenance types, it is important to include the storage, transformation and query requirements for the different kinds of provenance information and application domains in our considerations. The analysis of existing approaches will assist us in revealing open research problems in the area of data provenance.
2006 (2)
Spatial Multidimensional Sequence Clustering.Assent, I.; Krieger, R.; Glavic, B.; and Seidl, T.2006.In 1th International Workshop on Spatial and Spatio-temporal Data Mining (SSTDM) collocated with ICDM, 343-348. PdfPdfBibtexAbstract:
Measurements at different time points and positions in large temporal or spatial databases requires effective and efficient data mining techniques. For several parallel measurements, finding clusters of arbitrary length and number of attributes, poses additional challenges. We present a novel algorithm capable of finding parallel clusters in different structural quality parameter values for river sequences used by hydrologists to develop measures for river quality improvements.
sesam: Ensuring Privacy for an Interdisciplinary Longitudinal Study.Glavic, B., and Dittrich, K.2006.In Workshop Elektronische Datentreuhänderschaft - Anwendungen, Verfahren, Grundlagen collocated with GI Jahrestagung, 736-743, Dresden. PdfBibtexAbstract:
Most medical, biological and social studies face the problem of storing information about subjects for research purposes without violating the subject's privacy. In most cases it is not possible to remove all information that could be linked to a subject, because some of this information is needed for the research itself. This fact holds especially for longitudinal studies, which collect data about a subject at different times and places. Longitudinal studies need to link different data about a specific subject, collected at different times for research and administration use. In this paper we present the security concept proposed for sesam, a longitudinal interdisciplinary study that analyses the social, biological and psychological risk factors for the development of psychological diseases. Our security concept is based on pseudonymisation, encrypted data transfer and an electronic data custodianship. This paper is mainly a case study and some of the security problems emerged in the context of sesam may not occur in other studies. Nevertheless we believe that an adopted version of our approach could be used in other application scenarios as well.
2005 (1)
Subspace Sequence Clustering - Dataming zur Entscheidungsunterstützung in der Hydrologie.Glavic, B.2005.In Proceedings of Database Systems for Business, Technology, and Web (BTW) (Student Track), 15-17. PdfBibtex