...making Linux just a little more fun!
Ben Okopnik [ben at linuxgazette.net]
Just ran across Yet Another Proprietary Format from Micr0s0ft: .mht files. Seems that Internet Explorer saves emails and HTML as an ugly mess that somewhat resembles an email; according to Wikipedia, there's no single standard, and the state of the state is best described as 'sauve qui peut' (which translates, at least in Redmond, as "all your ass are belong to me!") Bleh.
Searching the Web shows that there are a lot of people - the just-converted-to-Linux newbies, particularly - who have loads of these things and don't know what to do with them. Some people recommend Opera (I suppose a couple of hours of Kiri Te Kanawa is good for relieving all kind of stress...); some have had luck with various conversion utilities. I looked at it, and it looked something like a mangled email header, soooo...
I didn't go searching for more than just the one file that I had, but here's what worked fine for opening it:
# Convert line-ends to Unix format flip -ub file.mht # Prepend a standard 'From ' mail header to the file sed -i '1i\'"$(echo From $USER $(date))" file.mht # You should now be able to open it with your favorite MUA mutt -f file.mht
It worked fine for me.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Mulyadi Santosa [mulyadi.santosa at gmail.com]
On Sat, Feb 14, 2009 at 8:08 AM, Ben Okopnik <[email protected]> wrote:
> Just ran across Yet Another Proprietary Format from Micr0s0ft: .mht > files. Seems that Internet Explorer saves emails and HTML as an ugly > mess that somewhat resembles an email; according to Wikipedia, there's > no single standard, and the state of the state is best described as > 'sauve qui peut' (which translates, at least in Redmond, as "all your > ass are belong to me!") Bleh. > > Searching the Web shows that there are a lot of people - the > just-converted-to-Linux newbies, particularly - who have loads of these > things and don't know what to do with them. Some people recommend Opera > (I suppose a couple of hours of Kiri Te Kanawa is good for relieving > all kind of stress...); some have had luck with various conversion > utilities. I looked at it, and it looked something like a mangled email > header, soooo... > > I didn't go searching for more than just the one file that I had, but > here's what worked fine for opening it: > > ``` > # Convert line-ends to Unix format > flip -ub file.mht > # Prepend a standard 'From ' mail header to the file > sed -i '1i\'"$(echo From $USER $(date))" file.mht > # You should now be able to open it with your favorite MUA > mutt -f file.mht > ''' > > It worked fine for me.
perhaps as the alternative of "flip", dos2unix could be used too here?
regards,
Mulyadi.
Thomas Adam [thomas.adam22 at gmail.com]
2009/2/14 Mulyadi Santosa <[email protected]>:
> perhaps as the alternative of "flip", dos2unix could be used too here?
Yes they could, but as I have said before in similar posts, "col" is perhaps the most portable way, across almost all UNIX variants.
-- Thomas Adam
Ben Okopnik [ben at linuxgazette.net]
On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote:
> 2009/2/14 Mulyadi Santosa <[email protected]>: > > perhaps as the alternative of "flip", dos2unix could be used too here? > > Yes they could, but as I have said before in similar posts, "col" is > perhaps the most portable way, across almost all UNIX variants.
I've got to say, I've never been a fan of 'col' - the man page has always confused the living hell out of me. What is a "half-reverse line feed", anyway?
If you have any favorite recipes you like to use with it, please share.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Jimmy O'Regan [joregan at gmail.com]
2009/2/14 Ben Okopnik <[email protected]>:
> On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote: >> 2009/2/14 Mulyadi Santosa <[email protected]>: >> > perhaps as the alternative of "flip", dos2unix could be used too here? >> >> Yes they could, but as I have said before in similar posts, "col" is >> perhaps the most portable way, across almost all UNIX variants. > > I've got to say, I've never been a fan of 'col' - the man page has > always confused the living hell out of me. What is a "half-reverse line > feed", anyway? > > If you have any favorite recipes you like to use with it, please share.
I just got this (the culprit shall remain nameless
for i in `wc -l * | grep ' * 0 ' | sed 's/ 0 /@/g' | cut -f2 -d'@'`; do rm $i; done
(but I think 'find . -size 0 -exec rm {} \;' is much easier to remember)
Ben Okopnik [ben at linuxgazette.net]
On Sun, Feb 15, 2009 at 03:10:55PM +0000, Jimmy O'Regan wrote:
> 2009/2/14 Ben Okopnik <[email protected]>: > > On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote: > >> 2009/2/14 Mulyadi Santosa <[email protected]>: > >> > perhaps as the alternative of "flip", dos2unix could be used too here? > >> > >> Yes they could, but as I have said before in similar posts, "col" is > >> perhaps the most portable way, across almost all UNIX variants. > > > > I've got to say, I've never been a fan of 'col' - the man page has > > always confused the living hell out of me. What is a "half-reverse line > > feed", anyway? > > > > If you have any favorite recipes you like to use with it, please share. > > I just got this (the culprit shall remain nameless > > for i in `wc -l * | grep ' * 0 ' | sed 's/ 0 /@/g' | cut -f2 -d'@'`; > do rm $i; done
(Whoops. I meant "If you have any recipes using 'col'" - I guess I didn't make it clear enough.)
Ye ghods, what a convoluted mess. 'find', as you point out, is the tool that would come to my mind first; it would do recursive removals, which the other one won't. If you just want to remove all the zero-length files in the current dir, this is simple enough:
ls -l|awk '$5~/^0$/{system("rm "$8)}'
If you wanted to do it with shell tools, there's always
for n in *; do [ -f $n -a ! -s "$n" ] && rm "$n"; done
> (but I think 'find . -size 0 -exec rm {} \;' is much easier to remember)
Yep.
I've been doing the LPI certification stuff lately, and one of the on-line study guides had some horribly convoluted tool kit stuff; a really ugly "Identify which line will not sort a file by the third word on a line" question. One of the examples went something like this:
cut -d ' ' -f 2 file.txt|paste - file.txt|sort|cut -f 2-
I guess this is part of the price I pay for knowing how to use "sort" properly ("sort -k2", anyone?) - and even if I didn't, I would have just used Perl or something. When stuff like this shows up... oh, the suffering. Took me a while to get it.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Jimmy O'Regan [joregan at gmail.com]
2009/2/15 Ben Okopnik <[email protected]>:
> On Sun, Feb 15, 2009 at 03:10:55PM +0000, Jimmy O'Regan wrote: >> 2009/2/14 Ben Okopnik <[email protected]>: >> > On Sat, Feb 14, 2009 at 03:42:27PM +0000, Thomas Adam wrote: >> >> 2009/2/14 Mulyadi Santosa <[email protected]>: >> >> > perhaps as the alternative of "flip", dos2unix could be used too here? >> >> >> >> Yes they could, but as I have said before in similar posts, "col" is >> >> perhaps the most portable way, across almost all UNIX variants. >> > >> > I've got to say, I've never been a fan of 'col' - the man page has >> > always confused the living hell out of me. What is a "half-reverse line >> > feed", anyway? >> > >> > If you have any favorite recipes you like to use with it, please share. >> >> I just got this (the culprit shall remain nameless >> >> for i in `wc -l * | grep ' * 0 ' | sed 's/ 0 /@/g' | cut -f2 -d'@'`; >> do rm $i; done > > (Whoops. I meant "If you have any recipes using 'col'" - I guess I > didn't make it clear enough.) >
Misread that; I was just struck by the awfulness of that line -- he also uses col in similarly evil ways
That was to work around the output of a script to grab Bible text for a parallel corpus; it turned out to be unneccessary: ftp://ftp.funet.fi/pub/doc/bible/texts/danish/dkbibel.txt.gz
and a little perl to split it by book and chapter:
#!/usr/bin/perl use warnings; use strict; use open IN => ':encoding(iso-8859-1)'; use open OUT => ':encoding(utf8)'; my $reading=0; my $sent; my $file; my $part; my ($a, $b, $htoi); while (<>) { if (/^horn\@proinf.dk$/) {$reading=1; next;} next if $reading==0; next if (/^$/); if(/^\*([\d]+)\/[\W]/) { if ($sent) { print OUTF "$sent\n"; $sent=""; # reset state on changing books } next; } if(/^\*([\d]*)\/([0-9]+).*$/) { $file=sprintf("book%02d.chapter%03d.txt",int($1),int($2)); open (OUTF, ">$file"); next; } if(/^\*([\d]*)\/([a-f])([0-9]+).*$/) { $htoi=(hex($2)*10)+int($3); $file=sprintf("book%02d.chapter%03d.txt",int($1),$htoi); open (OUTF, ">$file"); next; } if (/[\W]?([\d]+)[\W]?(.*)$/) { print OUTF "$sent\n" if ($sent); #Last sentence $part=$2; chomp $part; $sent="$1" . $part; next; } if(/^[\W]+(.*)$/) { $part=$1; chomp $part; $sent.=" " . $part; next; } }
and we don't have to go fighting with someone's website.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Feb 15, 2009 at 05:50:38PM +0000, Jimmy O'Regan wrote:
> > That was to work around the output of a script to grab Bible text for > a parallel corpus; it turned out to be unneccessary: > ftp://ftp.funet.fi/pub/doc/bible/texts/danish/dkbibel.txt.gz > > and a little perl to split it by book and chapter:
[snip]
I don't think that does what it's supposed to. I just tried it out, and 'book01.chapter001.txt' ends with a line numbered '30', while 'book01.chapter002.txt' starts with '31' followed by '1'. Now, I am not a Bible scholar by any means, but that numbering scheme just doesn't seem right.
How about this instead:
#!/usr/bin/perl use warnings; use strict; use open IN => ':encoding(iso-8859-1)'; use open OUT => ':encoding(utf8)'; my ($book, $chapter, $fn); while (<>){ if (m{^\*(\d\d)[/ ]([0-9a-f]\d)?}){ close Fn if defined \*Fn; $book = $1; $chapter = defined $2 ? $2 : 1; $chapter =~ s/([a-f])(\d)/100 + (ord($1) - 97) * 10 + $2/e; $fn = sprintf "book%02d.chapter%03d.txt", $book, $chapter; open Fn, ">$fn" or die "$fn: $!\n"; next; } next unless defined $fn; print Fn; } close Fn;
You might want to verify with a bit of spot-testing - that's a very strange numbering scheme they used - but I think I've got it right.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Lew Pitcher [lew.pitcher at digitalfreehold.ca]
Well, I've just received four copies (one after another) of this email. All four have
Date: Mon, 16 Feb 2009 23:03:46 -0500as their datestamp.
It looks like either
a) Ben is trying to be /very/ emphatic about his perl script ( ;-) ), or b) the LJ email server has hiccuped
So, can we expect more copies? <grin>
-- Lew Pitcher Master Codewright & JOAT-in-training | Registered Linux User #112576 http://pitcher.digitalfreehold.ca/ | GPG public key available by request ---------- Slackware - Because I know what I'm doing. ------
Rick Moen [rick at linuxmafia.com]
Quoting Lew Pitcher ([email protected]):
> It looks like either > a) Ben is trying to be /very/ emphatic about his perl script ( ;-) ), or > b) the LJ email server has hiccuped
Temporary mail glitch. I know how to break the cycle, and have done so.
Jimmy O'Regan [joregan at gmail.com]
2009/2/17 Ben Okopnik <[email protected]>:
> On Sun, Feb 15, 2009 at 05:50:38PM +0000, Jimmy O'Regan wrote: >> >> That was to work around the output of a script to grab Bible text for >> a parallel corpus; it turned out to be unneccessary: >> ftp://ftp.funet.fi/pub/doc/bible/texts/danish/dkbibel.txt.gz >> >> and a little perl to split it by book and chapter: > > [snip] > > I don't think that does what it's supposed to. I just tried it out, and > 'book01.chapter001.txt' ends with a line numbered '30', while > 'book01.chapter002.txt' starts with '31' followed by '1'. Now, I am not > a Bible scholar by any means, but that numbering scheme just doesn't > seem right. > > How about this instead: > > ``` > #!/usr/bin/perl > use warnings; > use strict; > use open IN => ':encoding(iso-8859-1)'; > use open OUT => ':encoding(utf8)'; > > my ($book, $chapter, $fn); > while (<>){ > if (m{^\*(\d\d)[/ ]([0-9a-f]\d)?}){ > close Fn if defined \*Fn; > $book = $1; > $chapter = defined $2 ? $2 : 1; > $chapter =~ s/([a-f])(\d)/100 + (ord($1) - 97) * 10 + $2/e; > $fn = sprintf "book%02d.chapter%03d.txt", $book, $chapter; > open Fn, ">$fn" or die "$fn: $!\n"; > next; > } > next unless defined $fn; > print Fn; > } > close Fn; > ''' > > You might want to verify with a bit of spot-testing - that's a very > strange numbering scheme they used - but I think I've got it right.
Weeell... as it happens, the source text has quite a number of other glitches: the book of Daniel starts in the middle of chapter 1, verse 2, for example. The combination of both the errors in my script, and the freeness of some bible translations has lead to a new research topic for the guy I wrote it for: he's now studying ways to measure when to drop candidate sentences from a bilingual corpus: 'The Spirit of God moved over *the face of* the waters', is an example -- extra junk thrown in to turn a better phrase is not productive in any kind of machine translation, and yet there's been surprisingly little work done in automating the process of removing it.
Thomas Adam [thomas.adam22 at gmail.com]
2009/2/17 Ben Okopnik <[email protected]>:
> How about this instead:
Works better here.
> $chapter = defined $2 ? $2 : 1;
I'm fortunate to not be affected by portability in perl, but I do much prefer that check above to perl's 5.10's "//" operator. I can't not parse that as some weird regexp operator. ;)
-- Thomas Adam
Ben Okopnik [ben at linuxgazette.net]
On Sat, Feb 14, 2009 at 10:02:20PM +0700, Mulyadi Santosa wrote:
> On Sat, Feb 14, 2009 at 8:08 AM, Ben Okopnik <[email protected]> wrote: > > Just ran across Yet Another Proprietary Format from Micr0s0ft: .mht > > files. Seems that Internet Explorer saves emails and HTML as an ugly > > mess that somewhat resembles an email; according to Wikipedia, there's > > no single standard, and the state of the state is best described as > > 'sauve qui peut' (which translates, at least in Redmond, as "all your > > ass are belong to me!") Bleh. > > > > Searching the Web shows that there are a lot of people - the > > just-converted-to-Linux newbies, particularly - who have loads of these > > things and don't know what to do with them. Some people recommend Opera > > (I suppose a couple of hours of Kiri Te Kanawa is good for relieving > > all kind of stress...); some have had luck with various conversion > > utilities. I looked at it, and it looked something like a mangled email > > header, soooo... > > > > I didn't go searching for more than just the one file that I had, but > > here's what worked fine for opening it: > > > > ``` > > # Convert line-ends to Unix format > > flip -ub file.mht > > # Prepend a standard 'From ' mail header to the file > > sed -i '1i\'"$(echo From $USER $(date))" file.mht > > # You should now be able to open it with your favorite MUA > > mutt -f file.mht > > ''' > > > > It worked fine for me. > > perhaps as the alternative of "flip", dos2unix could be used too here?
Sure; so could "sed -i 's/\r//' file" and "tr -d '\015' < file", etc.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *