In one of our many research projects here at, we are dealing with Big Data (is a 1..10 TB compressed database dump big? Well, probably not, but it is for us). Our first approach was to extract the data and store it in an SQL database, then run a bunch of queries and finally export the processed tables for other purposes. See the problem there? We used to use the database just like a, err... data processing tool? Unfortunately this wasn't working very well: we were having all kinds of performance bottlenecks since we were doing bulk inserts and bulk selects.

We then thought of using Spark or some other fancy stuff like that in order to stream process everything and just use text files. But, you know, we are a binary analysis company so most of the people here don't like garbage collectors (except me, the author of this blogpost, who like them very much). Anyway, we went from using MySQL to MongoDB+MySQL to MongoDB+PostgreSQL to, you guessed it, text files + good ol' Bash.

In this article I will persuade you, CEO at a brand-new Spark'ing startup that, sometimes, Bash'ing is all you need.

Note: If you haven't checked out our Big Match post, go read it.

Do you do reverse engineering for a living? Have you ever wasted several weeks reversing an open source library embedded in a target? Do you crave for better tools than copy-pasting random strings into Google?


Well, read anyway.

-- babush

Today we are proud of showing the world the first prototype of Big Match, our tool to find open-source libraries in binaries only using their strings. How does it work? Read this post to find out, or head to our demo website to try it out.

Spoiler: we are analyzing all the repositories on GitHub and building a search engine on top of that data.

Why don't you subscribe to our newsletter and get access to nightly builds? Srls - P. IVA: IT02776470359 - Via San Martino 23 - 42121 - Reggio Emilia, Italy -
Twitter - GitHub - Privacy policy