...making Linux just a little more fun!
Jimmy O'Regan [joregan at gmail.com]
Freelang has a lot of (usually small) dictionaries, for Windows. They have quite a few languages that aren't easy to find dictionaries for, so though the coverage and quality are usually quite low, they're sometimes all that's there.
So, an example: http://www.freelang.net/dictionary/albanian.php
Leads to a file, dic_albanian.exe
This runs quite well in Wine (I haven't found any other way of extracting the contents). On my system, the 'C:\users\jim\Local Settings\Application Data\Freelang Dictionary' translates to '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ Dictionary/'. The dictionary files are inside the 'language' directory.
Saving this as wb2dict.c:
#include <stdlib.h> #include <stdio.h> int main (int argc, char** argv) { char src[31]; char trg[53]; FILE* f=fopen(argv[1], "r"); if (f==NULL) { fprintf (stderr, "Error reading file: %s\n", argv[1]); exit(1); } while (!feof(f)) { fread(&src, sizeof(char), 31, f); fread(&trg, sizeof(char), 53, f); printf ("%s\n %s\n\n", src, trg); } fclose(f); exit(0); }
The next step depends on the contents... Albanian on Windows uses Codepage 1250, so in this case:
./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f --utf8 albanian-english dictzip albanian-english.dict (as root cp albanian-english.* /usr/share/dictd/
add these lines to /var/lib/dictd/db.list : database albanian-english { data /usr/share/dictd/albanian-english.dict.dz index /usr/share/dictd/albanian-english.index }
/etc/init.d/dictd restart
and now it's available: dict agim 1 definition found
From unknown [albanian-english]:
agim dawn
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 02:58:30PM +0100, Jimmy O'Regan wrote:
> Freelang has a lot of (usually small) dictionaries, for Windows. They > have quite a few languages that aren't easy to find dictionaries for, > so though the coverage and quality are usually quite low, they're > sometimes all that's there. > > So, an example: http://www.freelang.net/dictionary/albanian.php > > Leads to a file, dic_albanian.exe
Sweet. Thanks, Jimmy - I can use that!
> This runs quite well in Wine (I haven't found any other way of > extracting the contents). On my system, the 'C:\users\jim\Local > Settings\Application Data\Freelang Dictionary' translates to > '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ > Dictionary/'. The dictionary files are inside the 'language' > directory.
Oh, right - reminds me: for stuff like this, I've got a special directory I use so I don't have to hunt through the WINE structure. I created a symlink at ".wine/drive_c/temp/to_unix" that points to my /tmp directory, so if I just install the program to that directory, it shows up in my /tmp, all ready to be played with.
> Saving this as wb2dict.c:
[snip]
Whoops - that double-prints the last entry in the dictionary. Not a big deal, though.
> The next step depends on the contents... Albanian on Windows uses > Codepage 1250, so in this case: > > ./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f > --utf8 albanian-english > dictzip albanian-english.dict
Or, all of the above in one step:
#!/usr/bin/perl -w # Created by Ben Okopnik on Sun Sep 5 12:11:02 EDT 2010 use strict; die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n" unless @ARGV; use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")", OUT => ":utf8"; (my $dct = $ARGV[0]) =~ s/\.wb$//; $dct =~ tr/_ A-Z/-_a-z/; open my $in, $ARGV[0] or die "$ARGV[0]: $!\n"; open my $out, "|/usr/bin/dictfmt -f --utf8 $dct" or die "Pipe failure: $!\n"; { my $ret1 = read $in, my $src, 31; my $ret2 = read $in, my $tgt, 53; last unless $ret1 & $ret2; s/\0.*// for $src, $tgt; printf $out "%s\n %s\n\n", $src, $tgt; redo; } close $in; system ('dictzip', "$dct.dict"); print <<"+EOT+" database $dct.dict.dz { data /usr/share/dictd/$dct.dict.dz index /usr/share/dictd/$dct.index } +EOT+
Just specify the '.wb' file as the first argument and its encoding as the second.
> (as root > cp albanian-english.* /usr/share/dictd/ > > add these lines to /var/lib/dictd/db.list : > database albanian-english > { > data /usr/share/dictd/albanian-english.dict.dz > index /usr/share/dictd/albanian-english.index > }
For convenience, the script actually spits that out so it can be copied and pasted.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
> On Sun, Sep 05, 2010 at 02:58:30PM +0100, Jimmy O'Regan wrote: >> Freelang has a lot of (usually small) dictionaries, for Windows. They >> have quite a few languages that aren't easy to find dictionaries for, >> so though the coverage and quality are usually quite low, they're >> sometimes all that's there. >> >> So, an example: http://www.freelang.net/dictionary/albanian.php >> >> Leads to a file, dic_albanian.exe > > Sweet. Thanks, Jimmy - I can use that! > >> This runs quite well in Wine (I haven't found any other way of >> extracting the contents). On my system, the 'C:\users\jim\Local >> Settings\Application Data\Freelang Dictionary' translates to >> '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ >> Dictionary/'. The dictionary files are inside the 'language' >> directory. > > Oh, right - reminds me: for stuff like this, I've got a special > directory I use so I don't have to hunt through the WINE structure. I > created a symlink at ".wine/drive_c/temp/to_unix" that points to my /tmp > directory, so if I just install the program to that directory, it shows > up in my /tmp, all ready to be played with. > >> Saving this as wb2dict.c: > > [snip] > > Whoops - that double-prints the last entry in the dictionary. Not a > big deal, though. >
Ah well... I spent more time on the dict stuff than looking at the raw files/writing the C
It also loses the first entry (I think) because of the way dictfmt adds its initial entries.
>> The next step depends on the contents... Albanian on Windows uses >> Codepage 1250, so in this case: >> >> ./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f >> --utf8 albanian-english >> dictzip albanian-english.dict > > Or, all of the above in one step: > > ``` > #!/usr/bin/perl -w > # Created by Ben Okopnik on Sun Sep ?5 12:11:02 EDT 2010 > use strict; > > die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n" > ? ?unless @ARGV; > > use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")", > ? ?OUT => ":utf8"; > > (my $dct = $ARGV[0]) =~ s/\.wb$//; > $dct =~ tr/_ A-Z/-_a-z/; > open my $in, $ARGV[0] or die "$ARGV[0]: $!\n"; > open my $out, "|/usr/bin/dictfmt -f --utf8 $dct" > ? ?or die "Pipe failure: $!\n"; > > { > ? ?my $ret1 = read $in, my $src, 31; > ? ?my $ret2 = read $in, my $tgt, 53; > ? ?last unless $ret1 & $ret2; > ? ?s/\0.*// for $src, $tgt;
Not quite. The reason I used C was because the data showed some evidence of C string reuse: schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
... so you'd at least need to split both strings on \0
> ? ?printf $out "%s\n ? %s\n\n", $src, $tgt; > ? ?redo; > } > close $in; > system ('dictzip', "$dct.dict"); > > print <<"+EOT+" > database $dct.dict.dz > { > ? ? ? ?data ?/usr/share/dictd/$dct.dict.dz > ? ? ? ?index /usr/share/dictd/$dct.index > } > +EOT+ > ''' > > Just specify the '.wb' file as the first argument and its encoding as > the second. > >> (as root >> cp albanian-english.* /usr/share/dictd/ >> >> add these lines to /var/lib/dictd/db.list : >> database albanian-english >> ?{ >> ? data ?/usr/share/dictd/albanian-english.dict.dz >> ? index /usr/share/dictd/albanian-english.index >> } > > For convenience, the script actually spits that out so it can be copied > and pasted. > > > -- > * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET * > > TAG mailing list > TAG at lists.linuxgazette.net > http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net >
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 17:36, Jimmy O'Regan <joregan at gmail.com> wrote:
>>> Saving this as wb2dict.c: >> >> [snip] >> >> Whoops - that double-prints the last entry in the dictionary. Not a >> big deal, though. >> > > Ah well... I spent more time on the dict stuff than looking at the raw > files/writing the C > > It also loses the first entry (I think) because of the way dictfmt > adds its initial entries. >
This version fixes both problems:
#include <stdlib.h> #include <stdio.h>
int main (int argc, char** argv) { char src[31]; char trg[53]; int c; FILE* f=fopen(argv[1], "r"); if (f==NULL) { fprintf (stderr, "Error reading file: %s\n", argv[1]); exit(1); }
printf ("00-database-info\n Converted from %s\n\n", argv[1]); printf ("00-dummy-entry\n For dictfmt\n\n");
while ((c = (int) fgetc(f)) != EOF) { ungetc(c, f); fread(&src, sizeof(char), 31, f); fread(&trg, sizeof(char), 53, f); printf ("%s\n %s\n\n", src, trg); } fclose(f); exit(0); }
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:
> On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote: > > > { > > ? ?my $ret1 = read $in, my $src, 31; > > ? ?my $ret2 = read $in, my $tgt, 53; > > ? ?last unless $ret1 & $ret2; > > ? ?s/\0.*// for $src, $tgt; > > Not quite. The reason I used C was because the data showed some > evidence of C string reuse: > schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > > ... so you'd at least need to split both strings on \0
Actually, except for the double-printed entry, it produces precisely the same output as your program - so that seems to work just fine.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:
> > Not quite. The reason I used C was because the data showed some > evidence of C string reuse: > schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > > ... so you'd at least need to split both strings on \0
Just recalled: C strings are null-terminated, right? That means the assignment to the string will terminate at that first null, regardless of the content after it. I'm just doing that manually.
#include <stdlib.h> #include <stdio.h> int main() { char *str = "abc\0def"; printf("%s\n", str); exit(0); }
This will only print the first three characters of the string.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 18:13, Ben Okopnik <ben at linuxgazette.net> wrote:
> On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote: >> On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote: >> >> > { >> > ? ?my $ret1 = read $in, my $src, 31; >> > ? ?my $ret2 = read $in, my $tgt, 53; >> > ? ?last unless $ret1 & $ret2; >> > ? ?s/\0.*// for $src, $tgt; >> >> Not quite. The reason I used C was because the data showed some >> evidence of C string reuse: >> schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 >> "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 >> factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 >> >> ... so you'd at least need to split both strings on \0 > > Actually, except for the double-printed entry, it produces precisely the > same output as your program - so that seems to work just fine. >
Sorry, misread "s/\0.*//". I need 1) new glasses, and 2) to clean my monitor
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Jimmy O'Regan [joregan at gmail.com]
On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
> ``` > #!/usr/bin/perl -w > # Created by Ben Okopnik on Sun Sep ?5 12:11:02 EDT 2010 > use strict; > > die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n" > ? ?unless @ARGV; > > use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")", > ? ?OUT => ":utf8"; > > (my $dct = $ARGV[0]) =~ s/\.wb$//; > $dct =~ tr/_ A-Z/-_a-z/; > open my $in, $ARGV[0] or die "$ARGV[0]: $!\n"; > open my $out, "|/usr/bin/dictfmt -f --utf8 $dct" > ? ?or die "Pipe failure: $!\n"; >
print $out "00-dummy-entry\n For dictfmt\n\n";
here will get rid of the second bug I had
-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 06:55:24PM +0100, Jimmy O'Regan wrote:
> > print $out "00-dummy-entry\n For dictfmt\n\n"; > > here will get rid of the second bug I had
OK, so the "improved" version looks like this (I was trying to remember what in Perl handles C strings... 'pack/unpack', of course):
#!/usr/bin/perl -w # Created by Ben Okopnik on Sun Sep 5 12:11:02 EDT 2010 use strict; die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n" unless @ARGV; use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")", OUT => ":utf8"; (my $dct = $ARGV[0]) =~ s/\.wb$//; $dct =~ tr/_ A-Z/-_a-z/; open my $in, $ARGV[0] or die "$ARGV[0]: $!\n"; open my $out, "|/usr/bin/dictfmt -f --utf8 $dct" or die "Pipe failure: $!\n"; my $src; print $out "00-dummy-entry\n\tFor dictfmt\n\n"; printf "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84; close $in; system ('dictzip', "$dct.dict"); print <<"+EOT+" database $dct.dict.dz { data /usr/share/dictd/$dct.dict.dz index /usr/share/dictd/$dct.index } +EOT+
The amusing part is the amount of work done by that "printf" line. Real workhorse, that thing.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
Ben Okopnik [ben at linuxgazette.net]
On Sun, Sep 05, 2010 at 02:28:53PM -0400, Benjamin Okopnik wrote:
Whoops, one mistake there:
> printf "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;
Should be
printf $out "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *