This situation comes from soralind, and it is an extension of this post.
(Minor update 2014-08-16: fix the bug of subroutine weight_sum )
The aim is to sort strings lexicographically NOT alphabetically. For example, an alphabet with predetermined order: T A G C
. For a set of strings: AT AC TG TC
, to list them alphabetically:
AC
AT
TC
TG
lexicographical order:
TG
TC
AT
AC
Similarly, Schwartzian transform by weight is used.
#!/usr/bin/env perl
use strict;
my @alphabet = qw/T A G C/;
my @list = qw/AT AC TG TC/;
print "$_\n" for sort_lexicographically (\@alphabet, \@list);
sub sort_lexicographically {
my ( $alphabet, $list ) = @_;
my $n = scalar @$alphabet; # size of alphabet
# set weight for every letter
my %weight = ();
my $w = 0;
for my $key (@alphabet) {
$w++;
$weight{$key} = $w;
}
# SUM weight of a string
sub weight_sum {
my ( $weight, $s ) = @_;
my $sum = 0;
my @letters = reverse split //, $s;
for my $i ( 0 .. $#letters ) {
$sum += $$weight{ $letters[$i] } * ( $n**$i );
}
return $sum;
}
# sort by Schwartzian transform
@list = map { $_->[1] }
sort { $a->[0] <=> $b->[0] } # sort by weight
map { [ weight_sum( \%weight, $_ ), $_ ] } @list;
return @list;
}
Here are the weights of some strings. It’s kind of 4-bit digital, isn’t it? 4 is the size of alphabet.
string weight
T 1
A 2
G 3
C 4
TT 5
TA 6
TG 7
TC 8
AT 9
AA 10
AG 11
AC 12
GT 13
GA 14
GG 15
GC 16
CT 17
CA 18
CG 19
CC 20
This sorting strategy could also be easily implemented in other programming languages.