2012: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applications that scale without sacrificing ACID guarantees [7, 9]. The performance of these DBMSs is predicated on the existence of an optimal database design that is tailored for the unique characteristics of OLTP workloads. Deriving such designs for modern DBMSs is difficult, especially for enterprise-class OLTP systems, since they impose extra challenges: the use of stored procedures, the need for load balancing in the presence of time-varying skew, complex schemas, and deployments with larger number of partitions.

To this purpose, we present a novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by: (1) minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses, (2) extending the design space to include replicated secondary indexes, (4) organically handling stored procedure routing, and (3) scaling of schema complexity, data size, and number of partitions. This effort builds on two key technical contributions: an analytical cost model that can be used to quickly estimate the relative coordination cost and skew for a given workload and a candidate database design, and an informed exploration of the huge solution space based on large neighborhood search. To evaluate our methods, we integrated our database design tool with a high-performance parallel, main memory DBMS and compared our methods against both popular heuristics and a state-of-the-art research prototype [17]. Using a diverse set of benchmarks, we show that our approach improves throughput by up to a factor of 16x over these other approaches.


@inproceedings{pavlo2012skew, author = {Andrew Pavlo and Carlo Curino and Stanley B. Zdonik}, title = {Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems}, booktitle = {SIGMOD Conference}, year = {2012}, pages = {61-72}, ee = {http://doi.acm.org/10.1145/2213836.2213844}, }

2011: No bits left behind

This is a fun vision paper I wrote with Eugene Wu and Sam Madden, on how we should be more careful about leaving no bits behind (in the memory hierarchy), and suggest few ideas/improvements we can make to DBMSs today.




One of the key tenets of database system design is making efficient use of storage and memory resources. However, existing database system implementations are actually extremely wasteful of such resources; for example, most systems leave a great deal of empty space in tuples, index pages, and data pages, and spend many CPU cycles reading cold records from disk that are never used. In this paper, we identify a number of such sources of waste, and present a series of techniques that limit this waste (e.g., forcing better memory locality for hot data and using empty space in index pages to cache popular tuples) without substantially complicating interfaces or system design. We show that these techniques effectively reduce memory requirements for real scenarios from the Wikipedia database (by up to 17.8×) while increasing query performance (by up to 8×).

You can find the paper here: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper23.pdf

@inproceedings{wu2011nobits,   author    = {Eugene Wu and                Carlo Curino and                Samuel Madden},   title     = {No bits left behind},   booktitle = {CIDR},   year      = {2011},   pages     = {187-190},   ee        = {http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper23.pdf}, }



With Andy Pavlo, Djellel Difallah, and Phil Cudre-Maroux we just completed an interesting project: an extensible testbed for benchmarking relational databases.

This was motivated by the huge amount of time each one of us wasted in our research in building workloads (gathering real/realistic data and query loads, implementing driving infrastructures, collect statistics, plot statistics).

To reduce the wasted effort for others and to promote repeatability and ease of comparison of scientific results we pull together our independent efforts and created an extensible driving infrastructure for OLTP/Web workload, capable of:

  • Targeting all major relational DBMSs via JDBC (tested on MySQL, Postgres, Oracle, SQLServer, DB2, HSQLDB)
  • Precise rate control (allows to define and change over time the rate at which requests are submitted)
  • Precise transactional mixture control (allow to define and change over time % of each transaction type)
  • Access Distribution control (allows to emulate evolving hot-spots, temporal skew, etc..)
  • Support trace-based execution (ideal to handle real data)
  • Extensible design
  • Elegant management of SQL Dialect translations (to target various DBMSs)
  • Store-Procedure friendly architecture
  • Include 10 workloads
    • AuctionMark
    • Epinions
    • JPAB
    • Resource Stresser
    • SEATS
    • TATP
    • TPCC
    • Twitter
    • Wikipedia
    • YCSB

If you are interested in using our testbed, or to contribute to it visit:  http://oltpbenchmark.com/

A Relational rant about Hbase/Pig/Hadoop

Disclaimer: my background is heavily skewed in favor of Relational Databases, and I am fairly new to the entire Hadoop/Pig/Hbase world.

For a project I am involved in, I am in the process of setting up a 40 nodes Hadoop/Pig/Hbase cluster. I am learning the tools as I go, and I feel like sharing the pain and excitements of this phase.

While I am 100% sure that someone with more experience in this field would have avoided many dumb errors I committed, I represent an ideal example of how easy/hard it is to pick up Hadoop/Pig/Hbase, i.e., zero backround in hadoop/hbase/pig, strong background in related fields (DB and Java programming).

I am running hbase 0.90.3 on hadoop 0.20.3 and pig 0.9.1.

RANT 1: No documentation

My first rant is about the painful lack of documentation, examples, tutorials. I was expecting to join a “modern” project run by a hip community, so I was expecting amazing tools and great documentation. The code is mostly barebone, there is very few and only partial examples around, most of the knowledge is trapped in mailinglists and bug-tracking software. Not very pleasant experience, especially because the rate of evolution of the project is high enough that most of what you find is already obsolete.

RANT 2: Classpath what a nightmare

One of the most horrible things to deal with is surprise surprise: the classpath! Using Pig and Hadoop the actual active classpath is born as a funky combination of many variables, registering from the pig scripts and scanning of directories. This would not be a problem in itself, what made my life miserable was the following: you load the wrong .jar file (e.g., a slightly different version of hadoop or hbase or pig or guava or commons etc..) and the error that comes out is not something reasonable like “Wrong class version” but rather varios bizarre things like “IndexOutOfBoundException”, “EOFException”, and many others… now this is annoying and rather hard to debug (if you don’t expect it). Especially when it is some obscure Maven script that is sneaking the wrong jar into a directory you don’t even know it exists, much less you suspect is somehow every jar in that dir is part of the classpath. Another interesting thing I observed is that pig 0.9.1 “register” does not always work, sometimes you have to put the jar both in the register and in the -cp you pass to pig in input for it to work. Oh joy….

At least from my experience having a more systematic way to load and check the jars in the classpath would save lots of time, especially when you first start (I am now very careful to alway check the classpath anytime there is an error).

RANT 2: Poor error propagation

I run in a very privileged setting, I am root on every box in the cluster, and I debugged many of my errors by looking at the logs of Zookeeper, Job Tracker, Pig, Master Hbase nodes and what so ever. But the propagation of errors seems rather bad, in a normal “shared grid” environment, where you don’t necessarily have access to all these logs, I would have had a even harder time to debug my code (e.g., the pig script complete successfully, and the error is lost in the job tracker log).

RANT 3: Hbase is (not) easy to configure/run

I am doing the installation with the precious help of a skilled sys-admin, therefore I was thus expecting everything to run rather smoothly. Afterall, Hbase is giving up almost everything I love in life (transactions, secondary indexes, joins), in order to be super-scalable, robust, zero-admin. Well, not really… Having Hbase to work decently at scale seems a rather manual work of pampering and convincing the thing that “it’s ok” to go fast.

I am loading data in Hbase via Pig HbaseStorage and/or importtsv (after spending enough time to find the combination of versions that works properly together), and I was expecting the thing to scale linearly and trivially (and to load at some 10k row per box per second). At my first attempt the pig script was failing because of too much waits on a region or something. And even after I pre-split my table across many regions and I generate keys as an MD5(id)+id, there is a huge amount of skew (50% of request hit a single node), but after some more parameter tuning the thing at least complete the loading. I will do more performance debugging in the future. (I plan to generate hfiles directly and bulk load).

Bottomline, I am ok to give up all the fancy relational transactional guarantees for a worry-free, super-scalable, zero-admin, self-balancing something… not sure Hbase is quite there yet. I feel there is lots of hype associated to it.

Altogether I am a bit underwhelmed by Hbase, I haven’t seen anything I could not do about 5X faster with a set of MySQL boxes… Ok I know MySQL performance tuning much better than Hbase anything…  but I will work on tuning the performance more and at some point I will run some comparisons, I am still skeptical.


Now that I released some steam with my rants, let me say few words about what’s nice… What it is very nice for me is pig because it allows me to write very few lines of code that automagically get distributed and run like crazy. Let me reformulate, it is pig+hadoop… fine it is pig+hadoop+hbase. I have to admit that whenever you get the entire software-stack to play nice it is very exciting to be able to hack up some code in few minutes and have it parallelized across the entire cluster (and even more exciting on our main grid where you can spawn thousands of mappers in parallel).

Altogether, I am probably spoiled by having worked in the relational world for long time, where the systems are 30+ years old and thus many of these small issues have been solved (while plenty worst problems are lurking in the darkness of an unpredictable optimizer or hard to scale data model etc..). But this brings me to the moral of this post:


My word of advice to people that plan to start using Hbase/Pig/Hadoop… by any mean get into it, it is an exciting world, but be ready to deal with a hackish, unpolished, untamed, kicking and screaming, system.. be ready to open java source code of the system to figure out why things are not working, or how to use and API… be ready to have interfaces that are as stable as butterflies… If you are ready for that you will not be disappointed.


End of a Post-Doc code release spree..

As a end-of-a-post-doc activity I released the source code of several pieces of software:







6.830 Material

In Fall 2010 I teached the database class 6.830 at MIT in collaboration with Mike Stonebraker. You can find the material of the class at: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/


p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; color: #184fae} span.s1 {text-decoration: underline}

Schema Evolution: Datasets…

With the help of various students of Prof. Carlo Zaniolo at UCLA, I have gathered a long list of datasets to test schema evolution systems. You can find a brief summary of the system and links to their schemas at the following link: Schema Evolution Benchmark Datasets

Enjoy! And if you use this in a paper, don’t forget to acknowledge/cite us…



2011: “RelationalCloud: a Database Service for the cloud”

Conference: CIDR 2011


Carlo Curino, Evan P. C. Jones, Raluca Ada Popa, Nirmesh Malviya, Eugene Wu, Sam Madden, Hari Balakrishnan, Nickolai Zeldovich.


ABSTRACT: This paper introduces a new transactional “database-as-a-service” (DBaaS) called Relational Cloud. A DBaaS promises to move much of the operational burden of provisioning, configuration, scaling, performance tuning, backup, privacy, and access control from the database users to the service operator, offering lower overall costs to users. Early DBaaS efforts include Amazon RDS and Microsoft SQL Azure, which are promising in terms of establish- ing the market need for such a service, but which do not address three important challenges: efficient multi-tenancy, elastic scalability, and database privacy. We argue that these three challenges must be overcome before outsourcing database software and management becomes attractive to many users, and cost-effective for service providers. The key technical features of Relational Cloud include: (1) a workload-aware approach to multi-tenancy that identifies the workloads that can be co-located on a database server, achieving higher consolidation and better performance than existing approaches; (2) the use of a graph-based data partitioning algorithm to achieve near-linear elastic scale-out even for complex transactional workloads; and (3) an adjustable security scheme that enables SQL queries to run over encrypted data, including ordering operations, aggregates, and joins. An underlying theme in the design of the components of Relational Cloud is the notion of workload awareness: by monitoring query patterns and data accesses, the sys- tem obtains information useful for various optimization and security functions, reducing the configuration effort for users and operators.


p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.1px Helvetica} span.s1 {font: 9.0px Helvetica} span.s2 {font: 8.9px Helvetica}

2011: “Workload-aware Database Monitoring and Consolidation”

Conference: SIGMOD 2011


Carlo Curino, Evan P. C. Jones, Sam Madden, Hari Balakrishnan



In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation.

Our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12× higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and our institution, showing absolute consolidation ratios ranging between 5.5:1 and 17:1.


2011:Update Rewriting and Integrity Constraint Maintenance in a Schema Evolution Support System

Supporting legacy applications when the database schema evolves represents a long-standing challenge of practical and theoretical importance. Recent work has produced algorithms and systems that automate the process of data migration and query adaptation; however, the problems of evolving integrity constraints and supporting legacy updates under schema and integrity constraints evolution are significantly more difficult and have thus far remained unsolved. In this paper, we address this issue by introducing a formal evolution model for the database schema structure and its integrity constraints, and use it to derive update mapping techniques akin to the rewriting techniques used for queries. Thus, we (i) propose a new set of Integrity Constraints Modification Operators (ICMOs), (ii) characterize the impact on integrity constraints of structural schema changes, (iii) devise representations that enable the rewriting of updates, and (iv) develop a unified approach for query and update rewriting under constraints. We then describe the efficient implementation of these techniques provided by our PRISM++ system. The effectiveness of PRISM++ and its enabling technology has been verified on a testbed containing the evolution histories of several scientific databases and web information systems, including the Genetic DB Ensembl (410+ schema versions in 9 years), and Wikipedia (240+ schema versions in 6 years).