bashML: Why Spark when you can Bash?

Technical

04/12/2020Paolo Montesel

In one of our many research projects here at rev.ng, we are dealing with Big Data (is a 1..10 TB compressed database dump big? Well, probably not, but it is for us). Our first approach was to extract the data and store it in an SQL database, then run a bunch of queries and finally export the processed tables for other purposes. See the problem there? We used to use the database just like a, err... data processing tool? Unfortunately this wasn't working very well: we were having all kinds of performance bottlenecks since we were doing bulk inserts and bulk selects.

We then thought of using Spark or some other fancy stuff like that in order to stream process everything and just use text files. But, you know, we are a binary analysis company so most of the people here don't like garbage collectors (except me, the author of this blogpost, who like them very much). Anyway, we went from using MySQL to MongoDB+MySQL to MongoDB+PostgreSQL to, you guessed it, text files + good ol' Bash.

In this article I will persuade you, CEO at a brand-new Spark'ing startup that, sometimes, Bash'ing is all you need.

Note: If you haven't checked out our Big Match post, go read it.

An example dataset

For this tutorial we will be using a simplified version of our internal Big Match dataset of github repositories. It consists of a set of txt files and each of them contains a newline-separated list of sha256 hashes of strings found in a repo.

So for example, in file torvalds,linux.hashes.txt we have all the hashes of all the strings found in the linux kernel. They are also sorted for ease of use. It's clear that we use commas in place of '/' as a separator between username and repo name. So, in this case, torvalds,linux.hashes.txt refers to the GitHub repo torvalds/linux.

Let's take a quick look at one of our files:

head -n 5 ./dataset/torvalds,linux.hashes.txt

0000130323884123bd36b3460e2311191fb0663dc7765e2781b62e1bb4fb1694
000020f8aa6016e534263f726778ea7ed6f8bdc6eaad4db703200b37ae6cf00b
000050a6f4869b1ccb3dc2f76f857561d3bae7c1d01e7153c1a6ef543abbd3ff
000052d246cfb78ed0a80bd74071664dc6cb76e3b5586dfed18d8613251fdeba
00006711c3893c6716caeb147e8894ed5bd9a02d1b8743ac5b207ffcf4508494

wc -l ./dataset/torvalds,linux.hashes.txt

540297 ./dataset/torvalds,linux.hashes.txt

But before delving into the fun part, let's fix locale issues between tools. Trust me, you do want to do this, otherwise you will get all sorts of error when piping together different commands (e.g.: sort and join don't like each other very much).

export LC_ALL=C

Building blocks

Bash tools have all kinds of different options and there seem to be little to no consistency between them. We all know that: I know it, you know it, everybody knows it. So before starting we are going to create some helper functions for the most useful bashML operations out there.

We will start with a haskell-inspired helper, fst:

# a b => a
fst() {
    cut -d ' ' -f 1 "$@"
}

It's pretty simple: it takes the first column of a given file. By using $@, we can either give it a file argument or use it without arguments (think pipes) and it will Just Work™:

echo -e 'ciao mondo\nhello world' | fst

ciao
hello

fst <(echo -e 'ciao mondo\nhello world')

ciao
hello

If you are wondering what bash does on <(...), look no further:

echo <(ls)

/dev/fd/63

Long story short, bash replaces the argument with a file descriptor that's connected to the output of the command inside <(...). In fact, you can also cat it:

cat <(echo Hi)

Hi

With that out of the way, let's create some additional helpers:

# Like `fst`, but returns the second column:
# a b => b
snd() {
    cut -d ' ' -f 2 "$@"
}

# Same as `snd`, but can handle multi-character separators:
# a     b => b
snd_awk() {
    awk '{ print $2 }' "$@"
}

# Turn whitespaces into newlines:
# a b
# c
# =>
# a
# b
# c
flat() {
    tr ' ' '\n' "$@"
}

# Similar to python `enumerate()`:
# a
# b
# =>
# 0 a
# 1 b
enumerate() {
    nl -v0 -w1 -s' ' "$@"
}

# Swap the first two columns in a file (and ignore the other columns):
# a b => b a
swap() {
    awk '{ print $2 " " $1 }'
}

For the few of you unfamiliar with tr, it's a little nice utility that can translate single characters (aka, map one character to another) or delete a character from a stream.

Another nice utility is nl, which counts the number of lines in a file. Don't worry too much about the options: they are just there to output the format we want.

Example pipelines

It's now time to test our crazy bash skills to actually do something useful.

Repository similarity

Let's say we want to query our repos to find Linux forks just using common strings. We can use a scoring formula called Jaccard index (also called Jaccard similarity):

J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}

So, if $A$ and $B$ are our repositories, $J(A, B)$ is a measure of how many strings they share, divided by the total number of their strings. How can you do this in bash? Using comm and some basic awk.

If you are unfamiliar with comm, it's a simple command that is able to compare sorted files line by line. If you run it without any specific option it will output 3 columns formatted like this:

<lines unique to file 1> <lines unique to file 2> <common lines>

Remember to sort its input files otherwise comm will complain!

Anyway, if you run it with --total, it will also write a final line with the total count of lines for each column:

mkdir -p ./tmp/

# a.txt
cat > ./tmp/a.txt <<EOF
a
b
c
EOF

# b.txt
cat > ./tmp/b.txt <<EOF
c
d
e
EOF

# Compare the lines in a.txt and b.txt (they are already sorted!)
comm --total --check-order ./tmp/a.txt ./tmp/b.txt

a
b
		c
	d
	e
2	2	1	total

See that last line? It's the count we were talking about. We can throw away everything except that last line, do some simple math in awk, and get the Jaccard index of the input files:

jaccard() {
    # We don't need to sort because our input files are already sorted
    comm --total --check-order "$@" | tail -n 1 | awk '{ print ($3 / ($3 + $2 + $1)) }'
}
jaccard ./tmp/a.txt ./tmp/b.txt

0.2

Same thing, but with a different file:

# c.txt
cat > ./tmp/c.txt <<EOF
b
c
d
EOF

jaccard ./tmp/a.txt ./tmp/c.txt

0.5

In this examples, the first two files have only one line in common out of 5 unique strings, so $J(A, B) = \frac{1}{5} = 0.2$ . In the second example, they share 2 lines out of 4 unique strings, so $J(A, B) = \frac{2}{4} = 0.5$ .

Yes, this is what data scientists get paid to do by the way.

Now let's say we want to find the Jaccard similarity between Linux and the biggest repos in our dataset. We need to start by getting the list of such big files:

# Get list of (size, filename)
biggest_files() {
    find ./dataset/ -name "*.hashes.txt" -print0 \
    | xargs -0 du -a \
    | sort -t ' ' -nr \
    | head -n "$@"
}

echo '== Biggest files =='
biggest_files 5
echo '...'

== Biggest files ==
103288	./dataset/mirror,dd-wrt.hashes.txt
66824	./dataset/kishikawakatsumi,Mozc-for-iOS.hashes.txt
57564	./dataset/mirek190,x86-android-5.0.hashes.txt
51008	./dataset/CyberGrandChallenge,samples.hashes.txt
47068	./dataset/AndreyPopovNew,asuswrt-merlin-rt-n.hashes.txt
...

We now want to use the same jaccard() function we defined above but we want to run it faster by exploiting parallel processing. We can do that with GNU parallel.

GNU parallel has a metric ton of options. Yes, that is 1000-kilograms-many options. Unless you voted for Brexit or play football with your hands: in that case it's ~2205 pounds. Since we are Europeans and are not here to start a debate about our (superior) measurement system, we will just use 2 options and ignore the others:

cat ./tmp/a.txt | parallel -j$(nproc) -k -- echo {}

a
b
c

What's happening here? Quite a bit actually: parallel just ran echo {} 3 times, every time substituting {} with a line from the input. Remember: the calls to echo happen in parallel using multiple processes!

GNU parallel allows you to use {} multiple times if you want:

cat ./tmp/a.txt | parallel -j$(nproc) -k -- echo {} {} {}

a a a
b b b
c c c

The options we give to parallel are very important: -j$(nproc) is for using as many processes as the processing units on the system (returned by the command nproc), while -k tells parallel to honor the input order when printing the output (so we don't get, e.g., c c c then b b b then a a a).

We can now use our previous jaccard function inside parallel to get the similarity between Linux and the other repos. However, we also want a nice output so we throw in a call to printf. The command gets a little messy due to string formatting, but just focus on the jaccard part:

similarity() {
    # Export the `jaccard` function so that it's available to `parallel`
    export -f jaccard

    echo -e '\n== Similarity to torvalds/linux =='

    # Find most linux-like repos
    biggest_files 30 \
        | snd_awk \
        | parallel -j$(nproc) -k -- printf \'%0.5f %s\\n\' '$(jaccard ./dataset/torvalds,linux.hashes.txt {})' '{}'
}
similarity

== Similarity to torvalds/linux ==
0.27157 ./dataset/mirror,dd-wrt.hashes.txt
0.00068 ./dataset/kishikawakatsumi,Mozc-for-iOS.hashes.txt
0.22947 ./dataset/mirek190,x86-android-5.0.hashes.txt
0.00447 ./dataset/CyberGrandChallenge,samples.hashes.txt
0.16675 ./dataset/AndreyPopovNew,asuswrt-merlin-rt-n.hashes.txt
0.21503 ./dataset/ChrisP-Android,BananaPi-Android-4.2.2-Liab.hashes.txt
0.01760 ./dataset/scs,uclinux.hashes.txt
0.22131 ./dataset/xdtianyu,android_04.01.01_msm7627a.hashes.txt
0.03125 ./dataset/IIJ-NetBSD,netbsd-src.hashes.txt
0.17406 ./dataset/Mazout360,asuswrt-maz.hashes.txt
0.14198 ./dataset/labx-technologies-llc,mb-linux-labx.hashes.txt
0.03072 ./dataset/freebsd,freebsd.hashes.txt
0.03224 ./dataset/opnsense,src.hashes.txt
0.02743 ./dataset/Stichting-MINIX-Research-Foundation,netbsd.hashes.txt
0.25569 ./dataset/andy-padavan,rt-n56u.hashes.txt
0.25605 ./dataset/moonman,rt-n56u.hashes.txt
0.03061 ./dataset/CTSRD-CHERI,cheribsd.hashes.txt
0.64101 ./dataset/sonyxperiadev,kernel.hashes.txt
0.39307 ./dataset/beastix,beastix.hashes.txt
0.99817 ./dataset/srikard,linux.hashes.txt
0.99851 ./dataset/RobertCNelson,linux-stable-rcn-ee.hashes.txt
0.99705 ./dataset/lzto,linux.hashes.txt
0.99705 ./dataset/lovewilliam,linux.hashes.txt
0.99703 ./dataset/linuxppc,linux.hashes.txt
1.00000 ./dataset/torvalds,linux.hashes.txt
0.99978 ./dataset/mirrors,linux.hashes.txt
0.99999 ./dataset/mirrors,linux-2.6.hashes.txt
0.98244 ./dataset/android,kernel_common.hashes.txt
0.74318 ./dataset/kholk,kernel.hashes.txt
0.98396 ./dataset/tiwai,sound.hashes.txt

We can see that repos that are just forks of the Linux kernel get almost a perfect score. Then we have some repos that include the linux kernel but also have other stuff in them. Finally, we have some random repos like kishikawakatsumi/Mozc-for-iOS that have nothing to do with linux.

With this information we can, e.g., do a simple similarity-based repository deduplication. In fact, at rev.ng we used this technique in one of our projects.

Repository vector encoding

What if we want to use some fancy deeplearning algorithms on our repos downstream? It would be convenient to have them represented as sparse vectors.

In other words, we want this

repoA.txt
XXX
YYY
ZZZ

repoB.txt
XXX
ZZZ

to turn into this

repoA 1 2 3
repoB 1 3

Where the numbers are just unique IDs for XXX, YYY and ZZZ.

The first thing we need to do is to get the complete set of hashes of strings in our repos. That's easy enough:

# You'll see some `2>/dev/null` around, don't worry about them
all_hashes() {
    find ./dataset/ -name "*hashes.txt" -print0 \
    | xargs -0 cat \
    | sort -u 2>/dev/null
}

all_hashes | head -n 5

0000034e7c281278a72c526d845adf0987b0b8b95e58313aa4b40d70f34803d4
00000519febcda5e69c89268bdfcb81aa834a283cc5a78c9686e32a230e9ae9c
0000061336096c960e8ef4e6aa7b5bc54632f31c36ab951be63b3965666f0e84
000010be716d68cf986cdbbeca26a87927b0807a00a0e0e92f9a9a8a9adfe937
000010e36024656201b4a433629bf9867e91a9b1a77c217e2ce93e18ffd829dc

We then want to assign a unique number to each hash. This is how we can do it:

to_id() {
    enumerate | swap
}

echo -e 'XXX\nYYY\nZZZ' | to_id

XXX 0
YYY 1
ZZZ 2

Let's generate the full (hash, hash_id) map:

mkdir -p ./workdir
all_hashes | to_id > ./workdir/hash-to-id.txt
head -n 5 ./workdir/hash-to-id.txt

0000034e7c281278a72c526d845adf0987b0b8b95e58313aa4b40d70f34803d4 0
00000519febcda5e69c89268bdfcb81aa834a283cc5a78c9686e32a230e9ae9c 1
0000061336096c960e8ef4e6aa7b5bc54632f31c36ab951be63b3965666f0e84 2
000010be716d68cf986cdbbeca26a87927b0807a00a0e0e92f9a9a8a9adfe937 3
000010e36024656201b4a433629bf9867e91a9b1a77c217e2ce93e18ffd829dc 4

We now need to apply this mapping to every file in the dataset. For the sake of simplicity, we are going to focus on a single synthetic file called small/repo which we created by taking few string hashes from the linux kernel. How do we use the mapping in hash-to-id.txt to get our vectorized version of the repo?

Enter the join command.

join is another nice Unix utility and it does exactly what it says: an SQL-like join. Let's see an example:

# join.a.txt
cat > ./tmp/join.a.txt <<EOF
a A
b B
c C
EOF

# join.b.txt
cat > ./tmp/join.b.txt <<EOF
a
c
d
EOF

join ./tmp/join.a.txt ./tmp/join.b.txt

a A
c C

This is your usual INNER JOIN, but in bash. Using a similar command we can get our vectorized small/repo:

join ./dataset/small,repo.txt ./workdir/hash-to-id.txt

0000130323884123bd36b3460e2311191fb0663dc7765e2781b62e1bb4fb1694 6
000020f8aa6016e534263f726778ea7ed6f8bdc6eaad4db703200b37ae6cf00b 8
000050a6f4869b1ccb3dc2f76f857561d3bae7c1d01e7153c1a6ef543abbd3ff 25
000052d246cfb78ed0a80bd74071664dc6cb76e3b5586dfed18d8613251fdeba 28
00006711c3893c6716caeb147e8894ed5bd9a02d1b8743ac5b207ffcf4508494 38
0000b59a1fc1ad8cb80d9bb5f1ced09d1cc2dc0fcb4dffd6d052b53a41cbadd0 58
0000be5ac5f5b35c96bf00b2f316c6b8117060bdbb70739d6c9291bd6c14c750 64
0000d5532ff89dfbc379fa46bd96437bc75b8657f52e58b4a9ccf25ed5dfe1c2 69
00014a214af48578cc89734b0b347a1d30ec5678ca30b8721aedcde420829c6b 112
00015270780aa670736e5d8c830fadc5da8a8ea4ba00d938c9a2d710c5ffe2e6 118
00015d5df162d2f56901adab0fad1a6e68273c3dae4a7b6fa427ba0f09a76152 127
00019347bddcfe0cd6b54f6751d9928518ad1acff8c8489f14fb834da3795f64 143
0001b01cb2ac3d1d56426eaec1330aae75ba6c39a36e6ae3bace485187c75346 157
0001b55ac99431620a9fd98ba395491b095d9cf5ef7bcd9aaacd87caf00df943 159
0001cb7944b04d2de5c0369dfd1e901cc05b35cfa16679f262c705cf9219c600 167

We now strip away the hashes to keep only the indices:

echo -n 'small,repo '
join ./dataset/small,repo.txt ./workdir/hash-to-id.txt \
| snd \
| tr '\n' ' '

small,repo 6 8 25 28 38 58 64 69 112 118 127 143 157 159 167

And there it is. It's trivial to load that up in a scipy sparse matrix, or similar, and do some proper machine learning using a proper garbage-collected language.

Bonus point: bash visualization

Ok, CEO at a brand new Spark'ing startup, you are ready to jump on the bashML bandwagon after what I've shown you, but you still feel like you are missing all of your web-based scheduling dashboards?

We've got you covered.

At rev.ng we developed a simple tool that uses strace to track the flow of data between different processes: pipedream. It works by tracking file-related syscalls, along with pipes, and then trying to match I/O flows using file descriptors. It's not perfect, but it works in most cases.

Let's put everything we've seen in the last part of this blogpost in a single shell file and use pipedream to get a nice pipeline visualization:

cat > ./workdir/everything.sh <<EOF
#!/bin/sh

snd() {
    cut -d ' ' -f 2 "\$@"
}

echo 'small,repo '

join ./dataset/small,repo.txt ./workdir/hash-to-id.txt \
| snd \
| tr '\n' ' ' \
> ./workdir/small,repo.embedding.txt
EOF

chmod +x ./workdir/everything.sh

# Trace and convert to dotfile
pipedream trace ./workdir/everything.sh  2>/dev/null \
| pipedream print --exclude '^/' --include '(workdir|dataset)' \
> ./workdir/out.dot

# Call graphviz
dot -Tsvg -o ./workdir/out.svg ./workdir/out.dot
dot -Tpng -o ./workdir/out.png ./workdir/out.dot
dot -Tpdf -o ./workdir/out.pdf ./workdir/out.dot

And this is the pipedream we get:

Some considerations

Before leaving you to your very much important CEO duties, I'm going to gift you with some totally unbiased considerations about bashML.

Pros:

This is the Unix way (in fact, we just chained several programs from coreutils)
Pipe'ing is efficient enough for a lot of use-cases
Pipes don't require much memory
Pipes define a directed graph (even though it's better to save intermediate results if you use them more than once)
- Scheduling is built-in
Easy parallelization
- Commands are independent from one-another. When enough data is available to read from a pipe, they just run free.
- parallel
- xargs
- sort

Most importantly: no garbage collector.

Cons:

Limited in what you can do (e.g.: you wouldn't make a neural network in bash)
Limited in the kind of data you can handle
- Think quoted columns in a CSV file (unless you use some workaround like csvtool)
No types
Doing things in RAM is faster (if you are doing complex operations)
Doesn't scale well to multiple machines
Things can get pretty messy pretty fast
Sad data scientists

Conclusion

There, you heard it here folks, you don't need Spark anymore. You hold the power of bash now. It has worked for humanity for 30+ years and it's here to stay. The next time you need to process some data ask yourself: why Spark when you can Bash?

On a more serious note, at rev.ng we actually ended up using this Unix-style approach because it is a sweet-spot for our use-case. Databases were way too slow and setting up some compute cluster seemed overkill for the amount of data we are working with, especially since we can actually do everything in bash on a single machine.

Now turn off you computer and go ~~to sleep~~ tell your friends about how bash people were doing machine learning before it was cool.

Cheers,

babush