Tutorial :Where can I find an array of the (un)assigned Unicode code points for a particular block?



Question:

At the moment, I'm writing these arrays by hand.

For example, the Miscellaneous Mathematical Symbols-A block has an entry in hash like this:

my %symbols = (      ...      miscellaneous_mathematical_symbols_a => [(0x27C0..0x27CA), 0x27CC,          (0x27D0..0x27EF)],      ...  )  

The simpler, 'continuous' array

miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF]  

doesn't work because Unicode blocks have holes in them. For example, there's nothing at 0x27CB. Take a look at the code chart [PDF].

Writing these arrays by hand is tedious, error-prone and a bit fun. And I get the feeling that someone has already tackled this in Perl!


Solution:1

Perhaps you want Unicode::UCD? Use its charblock routine to get the range of any named block. If you want to get those names, you can use charblocks.

This module is really just an interface to the Unicode databases that come with Perl already, so if you have to do something fancier, you can look at the lib/5.x.y/unicore/UnicodeData.txt or the various other files in that same directory to get what you need.

Here's what I came up with to create your %symbols. I go through all the blocks (although in this sample I skip that ones without "Math" in their name. I get the starting and ending code points and check which ones are assigned. From that, I create a custom property that I can use to check if a character is in the range and assigned.

use strict;  use warnings;    digest_blocks();    my $property = 'My::InMiscellaneousMathematicalSymbolsA';    foreach ( 0x27BA..0x27F3 )      {      my $in = chr =~ m/\p{$property}/;        printf "%X is %sin $property\n",          $_, $in ? '' : ' not ';      }      sub digest_blocks {      use Unicode::UCD qw(charblocks);        my $blocks = charblocks();        foreach my $block ( keys %$blocks )          {          next unless $block =~ /Math/; # just to make the output small            my( $start, $stop ) = @{ $blocks->{$block}[0] };            $blocks->{$block} = {              assigned   => [ grep { chr =~ /\A\p{Assigned}\z/ } $start .. $stop ],              unassigned => [ grep { chr !~ /\A\p{Assigned}\z/ } $start .. $stop ],              start      => $start,              stop       => $stop,              name       => $block,              };            define_my_property( $blocks->{$block} );          }      }    sub define_my_property {      my $block = shift;        (my $subname = $block->{name}) =~ s/\W//g;      $block->{my_property} = "My::In$subname"; # needs In or Is        no strict 'refs';      my $string = join "\n", # can do ranges here too          map { sprintf "%X", $_ }           @{ $block->{assigned} };        *{"My::In$subname"} = sub { $string };      }  

If I were going to do this a lot, I'd use the same thing to create a Perl source file that has the custom properties already defined so I can just use them right away in any of my work. None of the data should change until you update your Unicode data.

sub define_my_property {      my $block = shift;        (my $subname = $block->{name}) =~ s/\W//g;      $block->{my_property} = "My::In$subname"; # needs In or Is        no strict 'refs';      my $string = num2range( @{ $block->{assigned} } );        print <<"HERE";  sub My::In$subname {      return <<'CODEPOINTS';  $string  CODEPOINTS      }    HERE      }    # http://www.perlmonks.org/?node_id=87538  sub num2range {    local $_ = join ',' => sort { $a <=> $b } @_;    s/(?<!\d)(\d+)(?:,((??{$++1})))+(?!\d)/$1\t$+/g;    s/(\d+)/ sprintf "%X", $1/eg;    s/,/\n/g;    return $_;  }  

That gives me output suitable for a Perl library:

sub My::InMiscellaneousMathematicalSymbolsA {      return <<'CODEPOINTS';  27C0    27CA  27CC  27D0    27EF  CODEPOINTS      }    sub My::InSupplementalMathematicalOperators {      return <<'CODEPOINTS';  2A00    2AFF  CODEPOINTS      }    sub My::InMathematicalAlphanumericSymbols {      return <<'CODEPOINTS';  1D400   1D454  1D456   1D49C  1D49E   1D49F  1D4A2  1D4A5   1D4A6  1D4A9   1D4AC  1D4AE   1D4B9  1D4BB  1D4BD   1D4C3  1D4C5   1D505  1D507   1D50A  1D50D   1D514  1D516   1D51C  1D51E   1D539  1D53B   1D53E  1D540   1D544  1D546  1D54A   1D550  1D552   1D6A5  1D6A8   1D7CB  1D7CE   1D7FF  CODEPOINTS      }    sub My::InMiscellaneousMathematicalSymbolsB {      return <<'CODEPOINTS';  2980    29FF  CODEPOINTS      }    sub My::InMathematicalOperators {      return <<'CODEPOINTS';  2200    22FF  CODEPOINTS      }  


Solution:2

Maybe this?

my @list =      grep {chr ($_) =~ /^\p{Assigned}$/}      0x27C0..0x27EF;  @list = map { $_ = sprintf ("%X", $_ )} @list;  print "@list\n";  

Gives me

  27C0 27C1 27C2 27C3 27C4 27C5 27C6 27C7 27C8 27C9 27CA 27D0 27D1 27D2 27D3   27D4 27D5 27D6 27D7 27D8 27D9 27DA 27DB 27DC 27DD 27DE 27DF 27E0 27E1 27E2   27E3 27E4 27E5 27E6 27E7 27E8 27E9 27EA 27EB  


Solution:3

I don't know why you wouldn't say miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF], because that's how the Unicode standard is defined according to the PDF.

What do you mean when you say it doesn't "work"? If it's giving you some sort of error when you check the existence of the character in the block, then why not just weed them out of the block when your checker comes across an error?


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »