A Relational rant about Hbase/Pig/Hadoop

Disclaimer: my background is heavily skewed in favor of Relational Databases, and I am fairly new to the entire Hadoop/Pig/Hbase world.

For a project I am involved in, I am in the process of setting up a 40 nodes Hadoop/Pig/Hbase cluster. I am learning the tools as I go, and I feel like sharing the pain and excitements of this phase.

While I am 100% sure that someone with more experience in this field would have avoided many dumb errors I committed, I represent an ideal example of how easy/hard it is to pick up Hadoop/Pig/Hbase, i.e., zero backround in hadoop/hbase/pig, strong background in related fields (DB and Java programming).

I am running hbase 0.90.3 on hadoop 0.20.3 and pig 0.9.1.

RANT 1: No documentation

My first rant is about the painful lack of documentation, examples, tutorials. I was expecting to join a “modern” project run by a hip community, so I was expecting amazing tools and great documentation. The code is mostly barebone, there is very few and only partial examples around, most of the knowledge is trapped in mailinglists and bug-tracking software. Not very pleasant experience, especially because the rate of evolution of the project is high enough that most of what you find is already obsolete.

RANT 2: Classpath what a nightmare

One of the most horrible things to deal with is surprise surprise: the classpath! Using Pig and Hadoop the actual active classpath is born as a funky combination of many variables, registering from the pig scripts and scanning of directories. This would not be a problem in itself, what made my life miserable was the following: you load the wrong .jar file (e.g., a slightly different version of hadoop or hbase or pig or guava or commons etc..) and the error that comes out is not something reasonable like “Wrong class version” but rather varios bizarre things like “IndexOutOfBoundException”, “EOFException”, and many others… now this is annoying and rather hard to debug (if you don’t expect it). Especially when it is some obscure Maven script that is sneaking the wrong jar into a directory you don’t even know it exists, much less you suspect is somehow every jar in that dir is part of the classpath. Another interesting thing I observed is that pig 0.9.1 “register” does not always work, sometimes you have to put the jar both in the register and in the -cp you pass to pig in input for it to work. Oh joy….

At least from my experience having a more systematic way to load and check the jars in the classpath would save lots of time, especially when you first start (I am now very careful to alway check the classpath anytime there is an error).

RANT 2: Poor error propagation

I run in a very privileged setting, I am root on every box in the cluster, and I debugged many of my errors by looking at the logs of Zookeeper, Job Tracker, Pig, Master Hbase nodes and what so ever. But the propagation of errors seems rather bad, in a normal “shared grid” environment, where you don’t necessarily have access to all these logs, I would have had a even harder time to debug my code (e.g., the pig script complete successfully, and the error is lost in the job tracker log).

RANT 3: Hbase is (not) easy to configure/run

I am doing the installation with the precious help of a skilled sys-admin, therefore I was thus expecting everything to run rather smoothly. Afterall, Hbase is giving up almost everything I love in life (transactions, secondary indexes, joins), in order to be super-scalable, robust, zero-admin. Well, not really… Having Hbase to work decently at scale seems a rather manual work of pampering and convincing the thing that “it’s ok” to go fast.

I am loading data in Hbase via Pig HbaseStorage and/or importtsv (after spending enough time to find the combination of versions that works properly together), and I was expecting the thing to scale linearly and trivially (and to load at some 10k row per box per second). At my first attempt the pig script was failing because of too much waits on a region or something. And even after I pre-split my table across many regions and I generate keys as an MD5(id)+id, there is a huge amount of skew (50% of request hit a single node), but after some more parameter tuning the thing at least complete the loading. I will do more performance debugging in the future. (I plan to generate hfiles directly and bulk load).

Bottomline, I am ok to give up all the fancy relational transactional guarantees for a worry-free, super-scalable, zero-admin, self-balancing something… not sure Hbase is quite there yet. I feel there is lots of hype associated to it.

Altogether I am a bit underwhelmed by Hbase, I haven’t seen anything I could not do about 5X faster with a set of MySQL boxes… Ok I know MySQL performance tuning much better than Hbase anything…  but I will work on tuning the performance more and at some point I will run some comparisons, I am still skeptical.

PROS:

Now that I released some steam with my rants, let me say few words about what’s nice… What it is very nice for me is pig because it allows me to write very few lines of code that automagically get distributed and run like crazy. Let me reformulate, it is pig+hadoop… fine it is pig+hadoop+hbase. I have to admit that whenever you get the entire software-stack to play nice it is very exciting to be able to hack up some code in few minutes and have it parallelized across the entire cluster (and even more exciting on our main grid where you can spawn thousands of mappers in parallel).

Altogether, I am probably spoiled by having worked in the relational world for long time, where the systems are 30+ years old and thus many of these small issues have been solved (while plenty worst problems are lurking in the darkness of an unpredictable optimizer or hard to scale data model etc..). But this brings me to the moral of this post:

MORAL

My word of advice to people that plan to start using Hbase/Pig/Hadoop… by any mean get into it, it is an exciting world, but be ready to deal with a hackish, unpolished, untamed, kicking and screaming, system.. be ready to open java source code of the system to figure out why things are not working, or how to use and API… be ready to have interfaces that are as stable as butterflies… If you are ready for that you will not be disappointed.