Sorting string lexicographically in Perl

· Read in about 2 min · (300 Words)

This situation comes from soralind, and it is an extension of this post.

(Minor update 2014-08-16: fix the bug of subroutine weight_sum  )

The aim is to sort strings lexicographically NOT alphabetically.  For example, an alphabet with predetermined order:  T A G C . For a set of strings: AT AC TG TC , to list them alphabetically:

AC
AT
TC
TG

lexicographical order:

TG
TC
AT
AC

Similarly, Schwartzian transform by weight is used.

#!/usr/bin/env perl
use strict;

my @alphabet = qw/T A G C/;
my @list = qw/AT AC TG TC/;

print "$_\n" for sort_lexicographically (\@alphabet, \@list);

sub sort_lexicographically {
    my ( $alphabet, $list ) = @_;
    my $n = scalar @$alphabet;  # size of alphabet

    # set weight for every letter
    my %weight = ();
    my $w      = 0;
    for my $key (@alphabet) {
        $w++;
        $weight{$key} = $w;
    }

    # SUM weight of a string
    sub weight_sum {
        my ( $weight, $s ) = @_;
        my $sum = 0;
        my @letters = reverse split //, $s;
        for my $i ( 0 .. $#letters ) {
            $sum += $$weight{ $letters[$i] } * ( $n**$i );
        }
        return $sum;
    }

    # sort by Schwartzian transform
    @list = map { $_->[1] }
        sort { $a->[0] <=> $b->[0] }    # sort by weight
        map { [ weight_sum( \%weight, $_ ), $_ ] } @list;

    return @list;
}

Here are the weights of some strings. It’s kind of 4-bit digital, isn’t it? 4 is the size of alphabet.

string  weight
T       1
A       2
G       3
C       4
TT      5
TA      6
TG      7
TC      8
AT      9
AA      10
AG      11
AC      12
GT      13
GA      14
GG      15
GC      16
CT      17
CA      18
CG      19
CC      20

This sorting strategy could also be easily implemented in other programming languages.