In one of our many research projects here at rev.ng, we are dealing with Big Data (is a 1..10 TB compressed database dump big? Well, probably not, but it is for us).
Our first approach was to extract the data and store it in an SQL database, then run a bunch of queries and finally export the processed tables for other purposes.
See the problem there? We used to use the database just like a, err... data processing tool?
Unfortunately this wasn't working very well: we were having all kinds of performance bottlenecks since we were doing bulk inserts and bulk selects.
We then thought of using Spark or some other fancy stuff like that in order to stream process everything and just use text files.
But, you know, we are a binary analysis company so most of the people here don't like garbage collectors (except me, the author of this blogpost, who like them very much).
Anyway, we went from using MySQL to MongoDB+MySQL to MongoDB+PostgreSQL to, you guessed it, text files + good ol' Bash.
In this article I will persuade you, CEO at a brand-new Spark'ing startup that, sometimes, Bash'ing is all you need.
Note: If you haven't checked out our Big Match post, go read it.