stevepedwards.com/DebianAdmin linux mint IT admin tips info

Exploring Find Cmd Options – Continuing With Metacharacters

Regular expression metacharacters usage consist of the following:

^ $ . [ ] { } - ? * + ( ) | \

These cannot be read about using the man pages for each symbol except the [ “test” chars e.g:

man ^

No manual entry for ^

These metacharacters can be read and expanded by the shell as special characters, or “escaped” by prefixing with the backslash “\” or quoted in “” to prevent expansion.

For the first example to show the difference between a literal or unescaped metacharacter, starting with probably the most commonly used metacharacter – the *;

Using the find command for example, an erroneous usage search attempt for any files that match ANY other character at the start of the file name:

find Videos/ -name *

find: paths must precede expression: chapter1.txt

Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]

This fails with stderror output as the shell attempts to expand the * to output all the files in the current directory (ABOVE Videos) so fails as a search path order conflict.

But, if the asterisk is quoted, all files in the Video dir are listed, as they all start with a character of some sort.

find Videos/ -name '*'

Videos/Martial Law 9-11- Rise Of The Police State.mp4

Videos/What Happened on the Moon.mp4

Videos/The Untold Secrets of NASA - Unbelievable Mars - Space Documentary(2015).mp4

Videos/Peter Joseph's 'Where are we going'.mp4…..etc.

So, escaping does the same:

find Videos/ -name \*

so gives the same result as above – all files in Videos are found and listed.

If the asterisk is prefixed or appended by another character, then it is not escaped so it can do the job it is intended for. To find all files starting with a “0”

find Videos/ -name 0*

Videos/027 Richard Dolan Montreal - ModernKnowledge @ CapricornRadioTV.mp4

What if you want to find all files starting with any number? You can use a range defined as:

find Videos/ -name '[0-9]*'

Videos/9-11- Decade of Deception (Full Film NEW 2015).mp4

Videos/027 Richard Dolan Montreal - ModernKnowledge @ CapricornRadioTV.mp4

but, the equivalent numbers alone also show files beginning with a 0 or a 9:

find Videos/ -name '[09]*'

Videos/9-11- Decade of Deception (Full Film NEW 2015).mp4

Videos/027 Richard Dolan Montreal - ModernKnowledge @ CapricornRadioTV.mp4

The lesson here is: Don't assume your command structure is correct globally on the basis of 1 result set! It just happens that this directory happens only to have files beginning with a 0 and a 9 so give the same result as 2 very different search conditions!

So, in English, you understand [0-9] as “find any file starting with a 0 OR a 1 OR a 2...9.”

To do the opposite – find all files NOT beginning with numbers:

find Videos/ -name '[!0-9]*'

Videos/Martial Law 9-11- Rise Of The Police State.mp4

Videos/What Happened on the Moon.mp4

Videos/The Untold Secrets of NASA - Unbelievable Mars - Space Documentary(2015).mp4

Videos/Peter Joseph's 'Where are we going'.mp4…..etc.

Note the shebang ! Has to be INSIDE the brackets as outside is interpreted by find as being the start of a file name, not a logical NOT parameter:

find Videos/ -name '![0-9]*'

(no files found as none begin with a “!”)

For NOT in a find command context, it would be a space delimited operator, separated from and before the file name option, to find all files NOT beginning with a number:

find Videos/ ! -name '[0-9]*'

Videos/Martial Law 9-11- Rise Of The Police State.mp4

Videos/What Happened on the Moon.mp4

Videos/The Untold Secrets of NASA - Unbelievable Mars - Space Documentary(2015).mp4

Videos/Peter Joseph's 'Where are we going'.mp4…..etc.

What about file name order relating to case and given by a range?

It's complex due to the POSIX or other ASCII standard your PC is set to, AND how find lists it's results depending on inode order also. This is why you get some seemingly weird results for alphabetical listings.

First – so I know how many files I have totally in Videos:

find Videos/ -name '*' | wc -l

137

Will Shott's TLCL.pdf gives examples for grep on p273 that show ranges the likes of:

find Videos/ -name '[ABCDEFGHIJKLMNOPQRSTUVWXZY]*' | wc -l

131

So I know I am not seeing ALL files, as 2 begin with numbers,

find Videos/ -name '[0-9]*' | wc -l

2

and 3 with lowercase e.g.

find Videos/ -name '[abcdefghijklmnopqrstuvwxyz]*' | wc -l

3

This does not account for all 137 files, so check these can be found using a number range [0-9], a lowercase range [a-z] and an uppercase range [A-Z] together to be sure as:

find Videos/ -name '[ABCDEFGHIJKLMNOPQRSTUVWXZYabcdefghijklmnopqrstuvwxyz]*' | wc -l

134

then:

find Videos/ -name '[ABCDEFGHIJKLMNOPQRSTUVWXZYabcdefghijklmnopqrstuvwxyz0123456789]*' | wc -l

136

BUT that's NOT exactly correct – I'm missing 1 file!?

How can I find it??

Use the NOT shebang ! with the find option...? Aha! A file that begins with a quote “ ' ”

find Videos/ ! -name '[0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXZY]*'

Videos/'Why in the World are They Spraying' Documentary HD (multiple language subtitles).mp4

I found that file by elimination logic in the find command, but how would you escape the quote character to find that file – assuming you knew it existed? It cannot be escaped in the range box – it is seen as a delimiter itself so expects input!

find Videos/ ! -name '[\'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXZY]*'

>

Going back to basics as at the start above using the \ escape does it:

find Videos/ -name \'*

Videos/'Why in the World are They Spraying' Documentary HD (multiple language subtitles).mp4

Because the ASCII range of keys is keymap dependent, but POSIX lists the historcal map numerically as Shott states:

Back when Unix was first developed, it only knew about ASCII characters, and this fea-

ture reflects that fact. In ASCII, the first 32 characters (numbers 0-31) are control codes

(things like tabs, backspaces, and carriage returns). The next 32 (32-63) contain printable

characters, including most punctuation characters and the numerals zero through nine.

The next 32 (numbers 64-95) contain the uppercase letters and a few more punctuation

symbols. The final 31 (numbers 96-127) contain the lowercase letters and yet more punc-

tuation symbols. Based on this arrangement, systems using ASCII used a collation order

that looked like this:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This differs from proper dictionary order, which is like this:

aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ...To support this ability, the POSIX standards introduced a concept called a locale, which

could be adjusted to select the character set needed for a particular location. We can see

the language setting of our system using this command:

[me@linuxbox ~]$ echo $LANG

en_US.UTF-8

With this setting, POSIX compliant applications will use a dictionary collation order

rather than ASCII order. This explains the behavior of the commands above. A character

range of [A-Z] when interpreted in dictionary order includes all of the alphabetic char-

acters except the lowercase “a”, hence our results…To partially work around this problem, the POSIX standard includes a number of character classes which provide useful ranges of characters.”

I have only 1 file that starts with a lowercase “a”:

find Videos/ -name a*

Videos/antigravity hutchison effect.mp4

So, if my $LANG variable is POSIX compliant:

$LANG

en_GB.UTF-8: command not found

I should NOT find that one file with Shott's [A-Z] – and I don't – only the string “anti”

find Videos/ -name '[A-Z]*' | grep anti

Videos/ILLUMINATI SECRETS - The New Atlantis - FEATURE FILM.mp4

find Videos/ -name '[A-Z]*' | wc -l
131
find Videos/ -name '[a-Z]*' | wc -l
134

Knowing this, AND that the find command also sorts in a combo of inode order, it can be understood why there is apparent illogical alphabetical listing order; RRC; in the output, such as part of all the files here:

find Videos/ -name '*'

Videos/Richplanet 2016 UK Tour - PART 2 OF 3.mp4

Videos/RP EP26 PT1.mp4

Videos/Crop Circles- The Hidden Truth - Part 4.mp4

Table 19-2: POSIX Character Classes

Character Class Description

[:alnum:] The alphanumeric characters. In ASCII, equivalent to:

[A-Za-z0-9]

[:word:] The same as [:alnum:], with the addition of the underscore

(_) character.

[:alpha:] The alphabetic characters. In ASCII, equivalent to:

[A-Za-z]

[:blank:] Includes the space and tab characters.

[:cntrl:] The ASCII control codes. Includes the ASCII characters 0

through 31 and 127.

[:digit:] The numerals zero through nine.

[:graph:] The visible characters. In ASCII, it includes characters 33

through 126.

[:lower:] The lowercase letters.

[:punct:] The punctuation characters. In ASCII, equivalent to:

[-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]

[:print:] The printable characters. All the characters in [:graph:]

plus the space character.

[:space:] The whitespace characters including space, tab, carriage

return, newline, vertical tab, and form feed. In ASCII,

equivalent to:

[ \t\r\n\v\f]

[:upper:] The uppercase characters.

[:xdigit:] Characters used to express hexadecimal numbers. In ASCII,

equivalent to:

[0-9A-Fa-f]

Remember, however, that this is not an example of a regular expression, rather it is the

shell performing pathname expansion….POSIX Basic Vs. Extended Regular Expressions

Just when we thought this couldn’t get any more confusing, we discover that POSIX also

splits regular expression implementations into two kinds: basic regular expressions

(BRE) and extended regular expressions (ERE). The features we have covered so far are

supported by any application that is POSIX compliant and implements BRE. Our grep

program is one such program.

What’s the difference between BRE and ERE? It’s a matter of metacharacters. With BRE,

the following metacharacters are recognized:

^ $ . [ ] *

All other characters are considered literals. With ERE, the following metacharacters (and

their associated functions) are added:

( ) { } ? + |

However (and this is the fun part), the “(”, “)”, “{”, and “}” characters are treated as

metacharacters in BRE if they are escaped with a backslash, whereas with ERE, preced-

ing any metacharacter with a backslash causes it to be treated as a literal. Any weirdness

that comes along will be covered in the discussions that follow.”

ls /usr/sbin/[[:upper:]]*

/usr/sbin/ModemManager /usr/sbin/VBoxControl

/usr/sbin/NetworkManager /usr/sbin/VBoxService

Find can be used similarly. Find all files starting with a lower case letter:

find Videos/ -name '[[:lower:]]*'

Videos/screencasts

Videos/hutchison effect wiki never before seen footage 3 25 2011.mp4

Videos/antigravity hutchison effect.mp4

Find all files NOT beginning with letters:

find Videos/ ! -name '[[:alpha:]]*'

Videos/'Why in the World are They Spraying' Documentary HD (multiple language subtitles).mp4

Videos/9-11- Decade of Deception (Full Film NEW 2015).mp4

Videos/027 Richard Dolan Montreal - ModernKnowledge @ CapricornRadioTV.mp4

Same as [0-9] above:

find Videos/ -name '[[:digit:]]*'

Videos/9-11- Decade of Deception (Full Film NEW 2015).mp4

Videos/027 Richard Dolan Montreal - ModernKnowledge @ CapricornRadioTV.mp4

Note complexity is required to NOT find either letters OR numbers = my "quote" started file name:

find Videos/ ! -name '[[:alpha:]]*' ! -name '[[:digit:]]*'

Videos/'Why in the World are They Spraying' Documentary HD (multiple language subtitles).mp4

This can be simplified to a NOT alphanumeric range:

find Videos/ ! -name '[0-9a-Z]*'

find Videos/ -name '[!0-9a-Z]*'

Videos/'Why in the World are They Spraying' Documentary HD (multiple language subtitles).mp4

Find all files without white space anywhere in the name:

find Videos/ ! -name '*[[:blank:]]*'

Videos/

Videos/Irrefutable.mp4

Videos/RichDLoydPye.mp4

Videos/screencasts

A really handy addition to the above for Linux systems is to rename all white spaced files without white space, say, for an MP3 collection, that show horrible looking Album/Track names on the command line often; using the -exec addition seen already in Cool Commands Posts:

BEFORE:

ls /Storebird/MP3/Candy\ Dulfer\ -\ Sax-A-Go-Go\ \(74321_111812\)/Candy\ Dulfer\ \ \ -\ 2\ Funky.mp3

run it for directories first:

find /Storebird/MP3/ -type d -name '*[[:blank:]]*' -exec rename "s/ //g" {} \;

AFTER:

ls /Storebird/MP3/CandyDulfer-Sax-A-Go-Go\(74321_111812\)/

Re-run it for files:

find /Storebird/MP3/ -type f -name '*[[:blank:]]*' -exec rename "s/ //g" {} \;

ls /Storebird/MP3/CandyDulfer-Sax-A-Go-Go\(74321_111812\)/CandyDulfer-2Funky.mp3

You could then edit and append these two lines to a simple shell script or alias to run on any directory you cd into e.g.:

find -type d -name "*[[:blank:]]*" -exec rename "s/ //g" {} \;

find -type f -name "*[[:blank:]]*" -exec rename "s/ //g" {} \;

vi ~/rmspaces.sh

rmspaces.png

candyd.png

term file searches are now tidy:

candynospace.png

To find and/or remove any files beginning with undesirable chars like my "quote" file that may start with any of

[-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~]

find Videos/ -name '[[:punct:]]*'
Videos/'Why in the World are They Spraying' Documentary HD (multiple language subtitles).mp4

That MP3 dir could have all those "Windows" legacy backslashes found too eh..?

find -type d -name '*[[:punct:]]*' 

Be safe and do the search check only FIRST before any rename! Make sure it does what you want it to!

I'll quote Shott completely for this last important example:

Finding Ugly Filenames With find

The find command supports a test based on a regular expression. There is an important

consideration to keep in mind when using regular expressions in find versus grep.

Whereas grep will print a line when the line contains a string that matches an expres-

sion, find requires that the pathname exactly match the regular expression. In the fol-

lowing example, we will use find with a regular expression to find every pathname that

contains any character that is not a member of the following set:

[-_./0-9a-zA-Z]

Such a scan would reveal path names that contain embedded spaces and other potentially

offensive characters:

find . -regex '.*[^-_./0-9a-zA-Z].*'

Due to the requirement for an exact match of the entire pathname, we use .* at both ends

of the expression to match zero or more instances of any character. In the middle of the

expression, we use a negated bracket expression containing our set of acceptable path-

name characters.”

It's easier for me to show the opposite, nicely named files in Videos, using that handy command, but negated:

find Videos/ ! -regex '.*[^-_./0-9a-zA-Z].*'

Videos/

Videos/Irrefutable.mp4

Videos/RichDLoydPye.mp4

Videos/screencasts

The point to take from that example is that find supports specific regex and iregex options – see the man page.

-regex pattern

File name matches regular expression pattern. This is a match

on the whole path, not a search. For example, to match a file

named `./fubar3', you can use the regular expression `.*bar.' or

`.*b.*3', but not `f.*r3'. The regular expressions understood

by find are by default Emacs Regular Expressions, but this can

be changed with the -regextype option.

That finds the weird file names in the mp3 folder for sure:

stevee@dellmint /Quadra/MP3 $ find . -regex '.*[^-_./0-9a-zA-Z].*'

regexmp3s.png

There are many album names with numbers enclosed in brackets () that can be found using:

find CandyD* -regex '.*[()]*'

candybraces.png

It's one thing to find them, but another to remove these brackets with their contents...

 

Comments are closed.

Post Navigation