...making Linux just a little more fun!

2-Cent Tips

2-cent Tips: understand file system hierarchy right from the man pages

Mulyadi Santosa [mulyadi.santosa at gmail.com]


Fri, 23 Jul 2010 14:28:18 +0700

Probably one of my shortest tips so far:

Confused with all those /proc, /sys, /dev, /boot etc really mean and why on Earth they are there? Simply type "man hier" in your shell and hopefully you'll understand :)

-- regards,

Mulyadi Santosa Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com

[ Thread continues here (4 messages/3.80kB) ]


2-cent tip: De-Microsofting text files

Ben Okopnik [ben at linuxgazette.net]


Fri, 23 Jul 2010 14:21:02 -0400

I was doing some PDF to HTML conversions today, and noticed some really ugly, borken content in the resulting files; the content had obviously been created via some Microsoft program (probably Word):

Just say ?<80><98>hello, world!?<80><99>?<80><9d>

I had a few dozen docs to fix, and didn't have a mapping of the characters with which I wanted to replace these ugly clumps of hex. That is, I could see what I wanted, but expressing it in code would take a bit more than that.

Then, I got hit by an idea. After I got up, rubbed the bruise, and took an aspirin, I wrote the following:

#!/usr/bin/perl -w
# Created by Ben Okopnik on Fri Jul 23 12:05:05 EDT 2010
use encoding qw/utf-8/;
 
my ($s, %seen) = do { local $/; <> };
# Delete all "normal" characters
$s =~ s/[\011\012\015\040-\176]//g;
print "#!/usr/bin/perl -i~ -wp\n\n";
for (split //, $s){ next if $seen{$_}++; print "s/$_//g;\n"; }

When this script is given a list of all the text files as arguments, it collects a unique list of the UTF-8 versions of all the "weird" characters and outputs a second Perl script which you can now edit to define the replacements:

#!/usr/bin/perl -i~ -wp
 
s/\xFE\xFF//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;

Note that the second half of each substitution is empty; that's where you put in your replacements, like so:

#!/usr/bin/perl -i~ -wp
 
s/\xFE\xFF//g;	# We'll get rid of the 'BOM' marker
s/?/"/g;
s/?/-/g;
s/?/'/g;
s/?/"/g;
s/?/-/g;
s/?/.../g;
s/?/'/g;
s/?/&copy;/g;	# We'll make an HTML entity out of this one

Now, just make this script executable, feed it a list of all your text files, and live happily ever after. Note that the original versions will be preserved with a '~' appended to their filenames, just in case.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

[ Thread continues here (5 messages/7.54kB) ]



Share

Talkback: Discuss this article with The Answer Gang

Copyright © 2010, . Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 177 of Linux Gazette, August 2010

Tux