We recently discussed filters in bash scripting, in this guide we would look more into the practical usage of using a filter program, and an example of such a program is grep.
grep, in its simplest form, is a filter program that displays lines of text from its input that contains a certain pattern.
It reads an input (standard input, or the file), uses the regular expression as specified by the user to perform an operation on the input, and displays the matched pattern to the standard output (screen).
If we look at this from another perspective, we can say, grep removes or filters out lines of text from its input that does not contain the pattern you supplied to it.
Its usage is as follows:
grep pattern [files or input]
As you can see, we started with the program name "grep", the pattern we wish to search for and then we specify the file we want to search the pattern. this can also be a standard input or piped information from another program.
For example, to look for the name "Classicpress" in all the file that ends in .doc, we do:
grep classicpress *.doc
Alternatively, you can take the output of a program e.g "who" and pipe it to grep and search for whatever username you want to look for, e.g
devsrealm@blog:~$ who | grep devsrelm devsrealm pts/0 2020-09-20 00:33 (192.168.43.205)
In the above example, I piped the output of who into grep, and I searched for "devsrealm", then grep would show you all the instances of where devsrealm is logged in.
Another useful feature of grep is the ability to count the matching lines, and not display them, it would return the number of text matched, for example, if you have the word "devsrealmer" on 5 lines in a data.txt file, you can use the following way to find it:
user@blog:~$ grep -c devsrealmer data.txt
user@blog:~$ 5
This is just the tip of the iceberg, the real power of grep comes through its use of a regular expression, so...
What is a Regular Expression?
grep stands for Global Regular Expression Parser, and makes strong use of a regular expression, grep is not the only program that uses the regular expression, it is used quite well in other nix programs, which is a guide for another day.
An understanding of a regular expression can give you a ninja arsenal in your toolbox, but what is a regular expression by itself...
A regular expression is a term used to describe a set of special text matching patterns, basically, you can use them to match character combinations in strings. For example:
^
- caret is used to match the beginning of any line, so, ^abc matches any line of text that begins with abc
An example:
root@blog:~# lastlog | grep '^sys'
sys **Never logged in**
systemd-network **Never logged in**
systemd-resolve **Never logged in**
syslog **Never logged in**
In the above example, I piped the output of lastlog into grep, I then searched for the line that begins with sys, if you have a file you want to search, you can do something as follows:
grep '^abc' file
I surrounded the caret and what I want to search (abc) with a quote, this is optional, but as soon as you have space in whatever you want to search, you might want to include a single quote or double quote, also, you can use it for escaping an actual GNU/Linux command, this way, grep knows you are referring to characters.
To search "abc" anywhere in the file, you do:
grep abc file
But as soon as you add the caret, then you are restricting it to "abc" that starts at the beginning of any line.
Let's see another example:
abc$
- This matches any line of text that ends with abc, so, this is the opposite of ^abc
To find the users using /bin/bash on my server, I do:
root@blog:~# grep '/bin/bash$' /etc/passwd
root:x:0:0:root:/root:/bin/bash
thisisme:x:1002:1003:Mr. Chicken,,,:/home/thisisme:/bin/bash
dumm:x:1004:1006:,,,:/home/dumm:/bin/bash
james:x:1005:1007:,,,:/home/james:/bin/bash
I was able to get that output because /bin/bash is the end of the line in the /etc/passwd file.
What if we combine both together:
^$
- This matches the beginning and end, if something begins and ends and there is nothing in between the beginning and the end, then we get a blank line.
a*
- This matches any sequence of zero or more a's
For example, if we do this:
grep 'ca*r' file
In the above example, I am searching for "car" or "caaaaar" or "caaaaaaaaaaar", this can match no "a" (recall that, it matches zero or more "a" preceding it) at all in between them or match thousands of a's
If you want to match at least one "a", then instead of using the asterisk sign, you use a plus (+), however, plus is actually not part of the standard regular expression, it is part of the extended regular expression, so, if you want to use an extended regular expression, you have to use a special option to grep, e.g:
grep -E 'ca+t' file
What if we have the following:
b[ieo]d
- This matches bid, bed, or bod, anything enclosed in the square brackets is used to match a single character, so, it picks the first character "i", and append it to d which gives us "bid", and it repeats it with the rest of the character in the square bracket. It won't match anything outside of that, e.g it won't match bieo.
It doesn't stop there, you can combine it with the asterisk option, e.g
grep 'b[ieo]*d' file
What the above would give us is, letter b, followed by zero or more i, e, or o, and followed by d, e.g:
- bd
- biiiiiiid
- beeed
- boooooooooood
- biiieeeeeooood
If you wonder why it matches "bd", this is because the asterisk option matches zero or more characters preceding it.
Another of the common regular expression is the dot(.), e.g
b.d
matches a "b", followed by any one character, followed by "d", so, it could match:
- bad
- bbd
- bcd
- bdd
- bed
- b?d
- b#d
It matches any character, it doesn't have to be an alphabet.
Of course, you can combine it with any option e.g:
grep 'b.*d' file
This would match "b", followed by anything character, and zero or more of that character and d, you get:
- bed
- beeeeeeed
- b??????d
- b00000000d
- b5d
You get the idea...
Let's take our last example:
B[a-zA-Z0-9]*B
- This would match any sequence (small letter a to z, the capital letter A to Z, 0 to 9) of zero or more of any those characters in between the pair of capital B's.
Another fun thing we can do with Regular Expression is the ability to return a combination of different words, we've done this before in the example above, but there is another fun way we can do it. For example, if you have a dictionary file (one is located at the /usr/share/dict/words), and you are told to return four consecutive vowels, how do you do that?
Don't get it twisted, the below is a way to do it with grep and regular expression:
grep -E '[aeiou]{4}' /usr/share/dict/words
First, we have the square bracket which contains the vowels we want to return, and we then have the curly brace which enables multipliers. So, {4} indicates that we are looking for four vowels consecutive row, here is the output:
Hawaiian Hawaiian's Iroquoian Kauai Kauai's Kilauea Kilauea's Louie Louie's Montesquieu Montesquieu's Quaoar Quaoar's Rouault Rouault's aqueous gooier gooiest obsequious obsequiously obsequiousness obsequiousness's onomatopoeia onomatopoeia's queue queue's ...
You can see how it returned the words that have four consecutive vowels. If you want to eliminate the 's
at the end of the words, you can do the following:
grep -E '[aeiou]{4}' /usr/share/dict/words | grep -v "'s"
I piped the output of the first grep to grep -v, which filters out the 's
This is all for now, you can check the man page for more, but if you utilize the above example properly, it can suffice for a wide array of use cases.
To conclude this guide, here are the programs that use the regular expression:
- grep
- sed
- awk
- and more...