Category Archives: Linux

Protection from unintended Reboots in Linux

Handling several servers in different concurrent ssh sessions can lead to confusion. This is explicitly dangerous when it comes to unintended reboots. Here, molly-guard jumps into place by adding a confirmation dialog to each reboot command that is executed from a remote shell.

You can install molly-guard in Ubuntu with the following command:

sudo aptitude install molly-guard

Quick & Dirty VoIP Conference Room

This note describes the quick & dirty setup of an VoIP conference room using Astisk PBX and Sipgate. Three files have to be adjusted accordingly:

sip.conf

[general]
canreinvite=nonat
nat=no
progressinband=yes
limitonpeers=yes
bindport = 5060
bindaddr = 0.0.0.0
context=incoming
qualify=no
callcounter=yes
allow=ulaw
allow=alaw
allow=g722
allow=g723
allow=g726
allow=gsm
srvlookup=yes
language=en
dtmfmode = rfc2833
register => <sipgate_username>:<sipgate_secret>@sipgate.de/<sipgate_phonenumber>
allowsubscribe=yes
notfiyringing=yes
notifybusy=yes
busy-level=1
fromdomain=<your_hostname>
;=========================
[sipgate]
context=conferencecontext
type=friend
insecure=invite,port
; nat=yes
username=<sipgate_username>
fromuser=<sipgate_phonenumber>
fromdomain=sipgate.de
secret=<<sipgate_secret>
host=sipgate.de
;qualify=yes

extensions.conf

[conferencecontext]
exten => <sipgate_phonenumber>,1,Goto(conf,1)
; uncomment in case recording is needed
; exten => conf,1,Set(MEETME_RECORDINGFILE=/tmp/conference-recording)
exten => conf,1,Meetme(1234,sr)
exten => conf,2,Hangup()

meetme.conf

[general]
[rooms]
conf => 1234,<conf_room_pin>

Quickly setup a new Root Server at Hetzner’s

Preparations in hetzner robot:

  • Activate rescue-system
  • Order automatic hardware reset
  • Login into rescue system and change password

Now, automatic setup can be done via installimage -c {configfile} where {configfile} contains the following:

DRIVE1 /dev/sda
DRIVE2 /dev/sdb
SWRAID 1
SWRAIDLEVEL 1
BOOTLOADER grub
HOSTNAME d0
PART swap   swap      8GB 
PART /boot  ext3       256M 
PART /      ext3      4GB 
PART /var   ext3      2GB 
PART lvm    vg0       all 
LV   vg0    kvm   /kvm    xfs   20G
IMAGE /root/.oldroot/nfs/install/../images/Ubuntu-1010-maverick-64-minimal.tar.gz

Output:

                Hetzner Online AG - installimage

  Your server will be installed now, this will take some minutes
             You can abort at any time with CTRL+C ...

         :  Reading configuration                           done 
   1/14  :  Deleting partitions                             done 
   2/14  :  Creating partitions and /etc/fstab              done 
   3/14  :  Creating software RAID level 1                  done 
   4/14  :  Creating LVM volumes                            done 
   5/14  :  Formatting partitions
         :    formatting /dev/md0 with swap                 done 
         :    formatting /dev/md1 with ext3                 done 
         :    formatting /dev/md2 with ext3                 done 
         :    formatting /dev/md3 with ext3                 done 
         :    formatting /dev/vg0/kvm with xfs              done 
   6/14  :  Mounting partitions                             done 
   7/14  :  Extracting image (local)                        done 
   8/14  :  Setting up network for eth0                     done 
   9/14  :  Executing additional commands
         :    Generating new SSH keys                       done 
         :    Generating mdadm config                       done 
         :    Generating ramdisk                            done 
         :    Generating ntp config                         done 
         :    Setting hostname                              done 
  10/14  :  Setting up miscellaneous files                  done 
  11/14  :  Setting root password                           done 
  12/14  :  Installing bootloader grub                      done 
  13/14  :  Running some ubuntu specific functions          done 
  14/14  :  Clearing log files                              done 

                  INSTALLATION COMPLETE
   You can now reboot and log in to your new system with
  the same password as you logged in to the rescue system.

Now you can reboot into your new system.

OpenWrt System Upgrade

There are two ways of easily upgrading an OpenWrt system. The old and nowadays deprecated way is:

mtd -r write [image_name] linux

Newer versions (I assume version > backfire) come with the sysupgrade tool:

sysupgrade [image_name] 

Using Ipredator independantly from the default Route

This article is a copy of the howto from the Ubuntu forum by simonn (http://ubuntuforums.org/showthread.php?t=1472045). All credit go to him. I'm just copying this for preservational reasons.

I have a home server running Lucid which basically runs our home lan, but I also wanted to be able to run transmission-daemon over an ipredator VPN connection completely independently of the ethernet port (as far as the application layer is concerned anyway).

Most of the howtos for setting up VPN use the VPN as the default route, however I still wanted to run a webserver, dnsmasq etc, not to mention free bandwidth access to my ISP etc.

Thus, this howto.

The ppp connection still tunnels through eth0, but as far everthing else is concerned my server has two independent network ports eth0 and ppp0 and applications use the default route via eth0 to our router unless explicitely directed down pppX.

I assume that you already have transmission-daemon installed.

Firstly, install linux-pptp:

$ sudo apt-get install linux-pptp

Create /etc/ppp/peers/ipredator, replace <username> with your user name.

pty "pptp vpn.ipredator.se --nolaunchpppd --loglevel 0"
lock
noauth
nobsdcomp
nodeflate
name <username>
remotename ipredator
ipparam ipredator
require-mppe-128
refuse-eap
maxfail 0
persist
mru 1435
mtu 1435
nolog

Edit /etc/ppp/chap-secrets and add a line like so, replacing <username> and <password> with your username and password:

# Secrets for authentication using CHAP
# client    server      secret      IP addresses
  ipredator     vpn.ipredator.se

For the ppp interface to work independantly, we need to create a routing table for it. Edit /etc/iproute2/rt_tables and add the 100 predator line so it looks like below:

#
# reserved values
#
255 local
254 main
253 default
0   unspec
#
# local
#
#1  inr.ruhep
100 ipredator

Edit /etc/default/transmission-daemon and add the BIND_ADDRESS parameter. Do set BIND_PARAMETER to 1.2.3.4 below. The ip address will be changed to the ip address of the ipredator ppp connection by /etc/ppp/ip-up.d/010ipredator when the connection is started/restarted.

# defaults for transmission-daemon
# sourced by /etc/init.d/transmission-daemon

# change to 0 to disable daemon
ENABLE_DAEMON=1

# this directory stores some runtime information, like torrent files and config
CONFIG_DIR="/var/lib/transmission-daemon/info" 

BIND_ADDRESS=1.2.3.4

# default options for daemon, see transmission-daemon(1) for more options
OPTIONS="-g $CONFIG_DIR -i $BIND_ADDRESS"

Create /etc/ppp/ip-up.d/010ipredator. This script is run whenever a connection is started. We use this script to set up the routing rules, firewall rules and to restart transmission-daemon binding it to the ip address of the ppp connection.

Note that you have to script this as a restart as /etc/ppp/ip-down.d/010ipredator is not called if the connection drops.

#!/bin/sh
#PPP_IPPARAM    : ipparam set in /etc/ppp/peers/ipredator
#IFNAME     : interface name. Usually ppp0.
#PPP_REMOTE : remote ip address
#PPP_LOCAL  : local ip address, i.e. the ip address of pppX

if [ "$PPP_IPPARAM" = "ipredator" ]; then
    # Delete any dangling ipredator rules
        ip rule | sed -n 's/.*\(from[ \t]*[0-9\.]*\).*ipredator/\1/p' | while read RULE
    do
        ip rule del $RULE
    done

    # Delete any unneccesary and dangling ipredator routes
    ip route | sed -n 's/^\(93.182.[0-9]*.2\).*/\1/p' | while read ROUTE
    do
        ip route del $ROUTE
    done

    # Add the rule to direct all traffic from pppX ip address to
    # the ipredator routing table
    ip rule add from $PPP_LOCAL lookup ipredator

    # Add the route to direct all traffic using the the ipredator 
    # routing table to the pppX interface
    ip route add default dev $IFNAME table ipredator 

    # ntpd will use the pppX interface, so block it
    iptables -A OUTPUT -o $IFNAME -p udp --dport 123 -j DROP

    # Open DHT port on pppX
    iptables -A INPUT -i $IFNAME -p tcp --dport 51413 -j ACCEPT

    # Bind transmission-daemon to the address of pppX
    sed -i "s/BIND_ADDRESS=[0-9\.]*/BIND_ADDRESS=$PPP_LOCAL/g" /etc/default/transmission-daemon

    # Restart transmission-daemon. Uncomment after testing.
    #/etc/init.d/transmission-daemon restart

fi

Create /etc/ppp/ip-down.d/010ipredator. No comments as it should be clear what is going on here. This is run whenever the ipredator connection is stopped. It is not run if the connection drops.

#!/bin/sh

if [ "$PPP_IPPARAM" = "ipredator" ]; then
    ip rule | sed -n 's/.*\(from[ \t]*[0-9\.]*\).*ipredator/\1/p' | while read RULE
    do
        ip rule del $RULE
    done

    ip route | sed -n 's/^\(93.182.[0-9]*.2\).*/\1/p' | while read ROUTE
    do
        ip route del $ROUTE
    done

    /etc/init.d/transmission-daemon stop

    iptables -D OUTPUT -o $IFNAME -p udp --dport 123 -j DROP
    iptables -D INPUT -i $IFNAME -p tcp --dport 51413 -j ACCEPT
fi

To start ipredator:

$ sudo pon ipredator

After a few seconds and all things going well running ifconfig should return a pppX entry, e.g.

$ ifconfig
....
ppp0      Link encap:Point-to-Point Protocol  
          inet addr:93.182.x.x  P-t-P:93.182.x.2  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1431  Metric:1
          RX packets:28291 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34498 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:3 
          RX bytes:9986616 (9.9 MB)  TX bytes:25842958 (25.8 MB)
....

inet addr:93.182.x.x is the ip address of the vpn connection.

If this interface does not appear look in /var/log/syslog for pppd messages.

Test the connection:

The following should return the ip address supplied by your ISP:

$ wget -qO - ip1.dynupdate.no-ip.com

The following, replacing with the obvious, should return your ipredator ip address (the pppX ip address):

$ wget --bind-address <pppX ip address> -qO - ip1.dynupdate.no-ip.com

If both the wget tests above work, all is well in the world. Uncomment out the "/etc/init.d/transmission-daemon restart" line in /etc/ppp/ip-down.d/010ipredator and...

$ sudo poff ipredator
$ sudo pon ipredator

This will start transmission-daemon automatically.

Using netstat -a you should see loads of connections to ipredatator made by transmission-daemon when torrents are started.

You can use many other commands via ipredator, but you have to expicitly use the pppX interface or ip address, e.g. wget as above, traceroute etc. If you want to use a browser via pppX you will need to setup a proxy server and bind/{,re}start it like transmission-daemon in /etc/ppp/ip-down.d/010ipredator, tinyproxy is probably your best bet for this.

OpenSSL Telnet Check

As an example:

openssl s_client -connect imap.gmx.de:993 -quiet

Fix a degraded MD Array

Fix a degraded array (example):

mdadm --re-add /dev/md0 /dev/sdb2

AWK: The Linux Administrators’ Wisdom Kit

Disclaimer: This article is written by Emmett Dulaney and published at http://crashrecovery.org/dulaney_awk.html. All credits on this article go to Emmett Dulaney. Its posted here only for personal reference and preservation reasons.

 

Learning Linux? This introduction to the invaluable AWK text-manipulation tool will be invaluable.

The AWK utility, with its own self-contained language, is one of the most powerful data processing engines in existence — not only in Linux, but anywhere. The limits to what can be done with this programming and data-manipulation language (named for the last initials of its creators, Alfred Aho, Peter Weinberger, and Brian Kernighan) are the boundaries of one's own knowledge. It allows you to create short programs that read input files, sort data, process it, perform arithmetic on the input, and generate reports, among myriad other functions.

What Is AWK?

To put it the simplest way possible, AWK is a programming-language tool used to manipulate text. The language of the AWK utility resembles the shell-programming language in many areas, although AWK's syntax is very much its own. When first created, AWK was designed to work in the text-processing arena, and the language is based on executing a series of instructions whenever a pattern is matched in the input data. The utility scans each line of a file, looking for patterns that match those given on the command line. If a match is found, it takes the next programming step. If no match is found, it then proceeds to the next line.

While the operations can get complex, the syntax for the command is always:

awk '{pattern + action}' {filenames} 

where pattern represents what AWK is looking for in the data, and action is a series of commands executed when a match is found. Curly brackets ({}) are not always required around your program, but they are used to group a series of instructions based on a specific pattern.

Understanding Fields

The utility separates each input line into records and fields. A record is a single line of input, and each record consists of several fields. The default-field separator is a space or a tab, and the record separator is a new line. Although both tabs and spaces are perceived as field separators by default (multiple blank spaces still count as one delimiter), the delimiter can be changed from white space to any other character.

To illustrate, look at the following employee-list file saved as emp_names:

46012   DULANEY     EVAN        MOBILE   AL
46013   DURHAM      JEFF        MOBILE   AL
46015   STEEN       BILL        MOBILE   AL
46017   FELDMAN     EVAN        MOBILE   AL
46018   SWIM        STEVE       UNKNOWN  AL
46019   BOGUE       ROBERT      PHOENIX  AR
46021   JUNE        MICAH       PHOENIX  AR
46022   KANE        SHERYL      UNKNOWN  AR
46024   WOOD        WILLIAM     MUNCIE   IN
46026   FERGUS      SARAH       MUNCIE   IN
46027   BUCK        SARAH       MUNCIE   IN
46029   TUTTLE      BOB         MUNCIE   IN

As AWK reads the input, the entire record is assigned to the variable $0. Each field, as split with the field separator, is assigned to the variables $1, $2, $3, and so on. A line contains essentially an unlimited number of fields, with each field being accessed by the field number. Thus, the command

awk '{print $1,$2,$3,$4,$5}' names

would result in a printout of

46012 DULANEY EVAN MOBILE AL
46013 DURHAM JEFF MOBILE AL
46015 STEEN BILL MOBILE AL
46017 FELDMAN EVAN MOBILE AL
46018 SWIM STEVE UNKNOWN AL
46019 BOGUE ROBERT PHOENIX AR
46021 JUNE MICAH PHOENIX AR
46022 KANE SHERYL UNKNOWN AR
46024 WOOD WILLIAM MUNCIE IN
46026 FERGUS SARAH MUNCIE IN
46027 BUCK SARAH MUNCIE IN
46029 TUTTLE BOB MUNCIE IN

An important item of noteworthiness is that AWK interprets the five fields as being separated by white space, but when it prints the display, there is only one space between each field. By virtue of the ability to address each field with a unique number, you can choose to print only certain fields. For example, to print only the names from each record, select only the second and third fields to print:

$ awk '{print $2,$3}' emp_names
DULANEY EVAN
DURHAM JEFF
STEEN BILL
FELDMAN EVAN
SWIM STEVE
BOGUE ROBERT
JUNE MICAH
KANE SHERYL
WOOD WILLIAM
FERGUS SARAH
BUCK SARAH
TUTTLE BOB
$

You can also specify that the fields print in any order, regardless of how they exist in the record. Thus, to show only the name fields, and reverse them so the first name is shown, then the last:

$ awk '{print $3,$2}' emp_names
EVAN DULANEY
JEFF DURHAM
BILL STEEN
EVAN FELDMAN
STEVE SWIM
ROBERT BOGUE
MICAH JUNE
SHERYL KANE
WILLIAM WOOD
SARAH FERGUS
SARAH BUCK
BOB TUTTLE
$

Working with Patterns

You can select the action to take place only on certain records, and not on all records, by including a pattern that must be matched. The simplest form of pattern matching is that of a search, wherein the item to be matched is included in slashes (/pattern/). For example, to perform the earlier action only on those employees who live in Alabama:

$ awk '/AL/ {print $3,$2}' emp_names
EVAN DULANEY
JEFF DURHAM
BILL STEEN
EVAN FELDMAN
STEVE SWIM
$

If you do not specify what fields to print, the entire matching entry will print:

$ awk '/AL/' emp_names
46012   DULANEY     EVAN     MOBILE     AL
46013   DURHAM      JEFF     MOBILE     AL
46015   STEEN       BILL     MOBILE     AL
46017   FELDMAN     EVAN     MOBILE     AL
46018   SWIM        STEVE    UNKNOWN    AL
$

Multiple commands for the same set of data can be separated with a semicolon (;). For example, to print names on one line and city and state on another:

$ awk '/AL/ {print $3,$2 ; print $4,$5}' emp_names
EVAN DULANEY
MOBILE AL
JEFF DURHAM
MOBILE AL
BILL STEEN
MOBILE AL
EVAN FELDMAN
MOBILE AL
STEVE SWIM
UNKNOWN AL
$

If the semicolon were not used (print $3,$2,$4,$5), all would appear on the same line. On the other hand, if the two print statements were given separately, an altogether different result would occur:

$ awk '/AL/ {print $3,$2} {print $4,$5}' emp_names
EVAN DULANEY
MOBILE AL
JEFF DURHAM
MOBILE AL
BILL STEEN
MOBILE AL
EVAN FELDMAN
MOBILE AL
STEVE SWIM
UNKNOWN AL
PHOENIX AR
PHOENIX AR
UNKNOWN AR
MUNCIE IN
MUNCIE IN
MUNCIE IN
MUNCIE IN
$

Fields three and two are given only when AL is found in the listing. Fields four and five, however, are unconditional and always print. Only the commands within the first set of curly braces are active for the command (/AL/) immediately preceding.

The result is altogether cumbersome to read and can use cleaning up a bit. First, insert a space and comma between city and state. Next, leave a blank line after each two-line display:

$ awk '/AL/ {print $3,$2 ; print $4", "$5"\n"}' emp_names
EVAN DULANEY
MOBILE, AL

JEFF DURHAM
MOBILE, AL

BILL STEEN
MOBILE, AL

EVAN FELDMAN
MOBILE, AL

STEVE SWIM
UNKNOWN, AL
$

Between the fourth and fifth fields, a comma and a space are added (between the quotation marks), and after the fifth field, a new line character (\n) is printed. All the special characters that can be used with the echo command can be also be used with AWK print statements, including:

    \n (new line)
    \t (tab)
    \b (backspace)
    \f (formfeed)
    \r (carriage return)

Thus, to read all five fields, which were originally separated by tabs, and print them with tabs as well, you could program

$ awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5}' emp_names
46012   DULANEY     EVAN     MOBILE    AL
46013   DURHAM      JEFF     MOBILE    AL
46015   STEEN       BILL     MOBILE    AL
46017   FELDMAN     EVAN     MOBILE    AL
46018   SWIM        STEVE    UNKNOWN   AL
46019   BOGUE       ROBERT   PHOENIX   AR
46021   JUNE        MICAH    PHOENIX   AR
46022   KANE        SHERYL   UNKNOWN   AR
46024   WOOD        WILLIAM  MUNCIE    IN
46026   FERGUS      SARAH    MUNCIE    IN
46027   BUCK        SARAH    MUNCIE    IN
46029   TUTTLE      BOB      MUNCIE    IN
$

You can search for more than one pattern match at a time by placing the multiple criteria in consecutive order and separating them with a pipe (|) symbol:

$ awk '/AL|IN/' emp_names
46012   DULANEY     EVAN     MOBILE    AL
46013   DURHAM      JEFF     MOBILE    AL
46015   STEEN       BILL     MOBILE    AL
46017   FELDMAN     EVAN     MOBILE    AL
46018   SWIM        STEVE    UNKNOWN   AL
46024   WOOD        WILLIAM  MUNCIE    IN
46026   FERGUS      SARAH    MUNCIE    IN
46027   BUCK        SARAH    MUNCIE    IN
46029   TUTTLE      BOB      MUNCIE    IN
$

This finds every match for Alabama and Indiana residents. A problem occurs, however, when you try to find the people who live in Arizona:

$ awk '/AR/' emp_names
46019   BOGUE       ROBERT   PHOENIX   AR
46021   JUNE        MICAH    PHOENIX   AR
46022   KANE        SHERYL   UNKNOWN   AR
46026   FERGUS      SARAH    MUNCIE    IN
46027   BUCK        SARAH    MUNCIE    IN
$

Employees 46026 and 46027 do not live in Arizona; however, their first names contain the character sequence being searched for. The important thing to remember is that when pattern matching in AWK, as in grep, sed, or most other Linux/Unix commands, look for a match anywhere in the record (line) unless told to do otherwise. To solve this problem, it is necessary to tie the search to a particular field. This goal is accomplished by means of a tilde (˜) and a specification to a specific field, as the following example illustrates:

$ awk '$5 ˜ /AR/' emp_names
46019   BOGUE       ROBERT   PHOENIX   AR
46021   JUNE        MICAH    PHOENIX   AR
46022   KANE        SHERYL   UNKNOWN   AR
$

The opposite of the tilde (signifying a match) is a tilde preceded by an exclamation mark (!˜). These characters tell the program to find all lines matching the search sequence, providing the sequence does not appear in the specified field:

$ awk '$5 !˜ /AR/' names
46012   DULANEY     EVAN     MOBILE    AL
46013   DURHAM      JEFF     MOBILE    AL
46015   STEEN       BILL     MOBILE    AL
46017   FELDMAN     EVAN     MOBILE    AL
46018   SWIM        STEVE    UNKNOWN   AL
46024   WOOD        WILLIAM  MUNCIE    IN
46026   FERGUS      SARAH    MUNCIE    IN
46027   BUCK        SARAH    MUNCIE    IN
46029   TUTTLE      BOB      MUNCIE    IN
$

In this case, it displayed all lines that do not have AR in the fifth field — including the two Sarah's entries that do have AR, but in the third field instead of the fifth one.

Braces and Field Separators

The bracket characters play an important part in the AWK commands. The actions that appear between them spell out what will take place and when it will take place. When only one set of brackets is used:

{print $3,$2}

all the actions between occur at the same time. When more than one set of brackets is used:

{print $3}{print $2}

the first sequence of commands is carried out until completion, then the second sequence kicks in. Notice the difference between the two listings that follow:

$ awk '{print $3,$2}' names
EVAN DULANEY
JEFF DURHAM
BILL STEEN
EVAN FELDMAN
STEVE SWIM
ROBERT BOGUE
MICAH JUNE
SHERYL KANE
WILLIAM WOOD
SARAH FERGUS
SARAH BUCK
BOB TUTTLE
$

$ awk '{print $3}{print $2}' names
EVAN
DULANEY
JEFF
DURHAM
BILL
STEEN
EVAN
FELDMAN
STEVE
SWIM
ROBERT
BOGUE
MICAH
JUNE
SHERYL
KANE
WILLIAM
WOOD
SARAH
FERGUS
SARAH
BUCK
BOB
TUTTLE
$

To reiterate the findings with multiple sets of brackets, the commands within the first set are carried out until completion; processing then moves to the second set. If there were a third set, it would go to that set on completion of the second set, and so on. In the generated printout, there are two separate print commands, so the first one is carried out, followed by the second, causing the display for each entry to appear on two lines instead of one.

The field separator differentiating one field from another need not always be white space; it can be any discernible character. To illustrate, assume the emp_names file separated the fields with colons instead of tabs:

$ cat emp_names
46012:DULANEY:EVAN:MOBILE:AL
46013:DURHAM:JEFF:MOBILE:AL
46015:STEEN:BILL:MOBILE:AL
46017:FELDMAN:EVAN:MOBILE:AL
46018:SWIM:STEVE:UNKNOWN:AL
46019:BOGUE:ROBERT:PHOENIX:AR
46021:JUNE:MICAH:PHOENIX:AR
46022:KANE:SHERYL:UNKNOWN:AR
46024:WOOD:WILLIAM:MUNCIE:IN
46026:FERGUS:SARAH:MUNCIE:IN
46027:BUCK:SARAH:MUNCIE:IN
46029:TUTTLE:BOB:MUNCIE:IN
$

If you attempted to print the last name by specifying that you wanted the second field with

$ awk '{print $2}' emp_names

you would end up with twelve blank lines. Because there are no spaces in the file, there are no discernible fields beyond the first one. To solve the problem, AWK must be told that a character other than white space is the delimiter, and there are two methods by which to inform AWK of the new field separator: Use the command-line parameter -F, or specify the variable FS within the program. Both strategies work equally well, with one exception, as illustrated by the following example:

$ awk '{FS=":"}{print $2}' emp_names

DURHAM
STEEN
FELDMAN
SWIM
BOGUE
JUNE
KANE
WOOD
FERGUS
BUCK
TUTTLE
$

$ awk -F: '{print $2}' emp_names
DULANEY
DURHAM
STEEN
FELDMAN
SWIM
BOGUE
JUNE
KANE
WOOD
FERGUS
BUCK
TUTTLE
$

In the first command, a blank line is incorrectly returned for the very first record, while all the others work as they should. It is not until the second record is read that the field separator is recognized and properly acted on. This shortcoming can be corrected by using a BEGIN statement (more on that later). The -F works much like a BEGIN and is able to correctly read the first record and act on it as it should.

As I mentioned at the start of this article, the default display/output field separator is a blank space. This feature can be changed within the program by using the Output Field Separator (OFS) variable. For example, to read the file (separated by colons) and display it with dashes, the command would be

$ awk -F":" '{OFS="-"}{print $1,$2,$3,$4,$5}' emp_names
46012-DULANEY-EVAN-MOBILE-AL
46013-DURHAM-JEFF-MOBILE-AL
46015-STEEN-BILL-MOBILE-AL
46017-FELDMAN-EVAN-MOBILE-AL
46018-SWIM-STEVE-UNKNOWN-AL
46019-BOGUE-ROBERT-PHOENIX-AZ
46021-JUNE-MICAH-PHOENIX-AZ
46022-KANE-SHERYL-UNKNOWN-AR
46024-WOOD-WILLIAM-MUNCIE-IN
46026-FERGUS-SARAH-MUNCIE-IN
46027-BUCK-SARAH-MUNCIE-IN
46029-TUTTLE-BOB-MUNCIE-IN
$

FS and OFS, (input) Field Separator and Output Field Separator, are but a couple of the variables that can be used within the AWK utility. For example, to number each line as it is printed, use the NR variable in the following manner:

$ awk -F":" '{print NR,$1,$2,$3}' emp_names
1 46012 DULANEY EVAN
2 46013 DURHAM JEFF
3 46015 STEEN BILL
4 46017 FELDMAN EVAN
5 46018 SWIM STEVE
6 46019 BOGUE ROBERT
7 46021 JUNE MICAH
8 46022 KANE SHERYL
9 46024 WOOD WILLIAM
10 46026 FERGUS SARAH
11 46027 BUCK SARAH
12 46029 TUTTLE BOB
$

To find all lines with employee numbers between 46012 and 46015:

$ awk -F":" '/4601[2-5]/' emp_names
46012   DULANEY EVAN  MOBILE AL
46013   DURHAM  JEFF  MOBILE AL
46015   STEEN   BILL  MOBILE AL
$

Adding Text

Text may be added to the display in the same manner as control sequences or other characters are. For example, to change the delimiter from spaces to colons, the command could be

awk '{print $1":"$2":"$3":"$4":"$5}' emp_names > new_emp_names

In this case, the character (:), enclosed in quotation marks ("/"), is added between each of the fields. This value within the quotation marks can be anything. For example, to create a database-like display of the employees living in Alabama:

$ awk '$5 ~ /AL/ {print "NAME: "$2", "$3"\nCITY-STATE:
  "$4", "$5"\n"}' emp_names

NAME: DULANEY, EVAN
CITY-STATE: MOBILE, AL

NAME: DURHAM, JEFF
CITY-STATE: MOBILE, AL

NAME: STEEN, BILL
CITY-STATE: MOBILE, AL

NAME: FELDMAN, EVAN
CITY-STATE: MOBILE, AL

NAME: SWIM, STEVE
CITY-STATE: UNKNOWN, AL
$

<h4>Math Operations</h4>

In addition to the textual possibilities AWK provides, it also offers a full range of arithmetic operators, including the following:

+ adds numbers together
- subtracts
* multiplies
/ divides
^ performs exponential mathematics
% gives the modulo

++ adds one to the value of a variable
+= assigns the result of an addition operation to a variable
— subtracts one from a variable
-= assigns the result of a subtraction operation to a variable
*= assigns the result of multiplication
/= assigns the result of division

%= assigns the result of a modulo operation

For example, assume the following file exists on your machine detailing the inventory in a hardware store:

$ cat inventory
hammers 5       7.99
drills  2      29.99
punches 7       3.59
drifts  2       4.09
bits   55       1.19
saws  123      14.99
nails 800        .19
screws 80        .29
brads 100        .24
$

The first order of business is to compute the value of each item's inventory by multiplying the value of the second field (quantity) by the value of the third field (price):

$ awk '{print $1,"QTY: "$2,"PRICE: "$3,"TOTAL: "$2*$3}' inventory
hammers QTY: 5 PRICE: 7.99 TOTAL: 39.95
drills QTY: 2 PRICE: 29.99 TOTAL: 59.98
punches QTY: 7 PRICE: 3.59 TOTAL: 25.13
drifts QTY: 2 PRICE: 4.09 TOTAL: 8.18
bits QTY: 55 PRICE: 1.19 TOTAL: 65.45
saws QTY: 123 PRICE: 14.99 TOTAL: 1843.77
nails QTY: 800 PRICE: .19 TOTAL: 152
screws QTY: 80 PRICE: .29 TOTAL: 23.2
brads QTY: 100 PRICE: .24 TOTAL: 24
$

If the lines themselves are unimportant, and you want only to determine exactly how many items are in the store, you can assign a generic variable to increment by the number of items in each record:

$ awk '{x=x+$2} {print x}' inventory
5
7
14
16
71
194
994
1074
1174
$

According to this data, 1,174 items are in the store. The first time through, the variable x had no value, so it took the value of the first line's second field. The next time through, it retained the value of the first line and added the value from the second line, and so on, until it arrived at a cumulative total.

The same process can be applied to determining the total value of the inventory on hand:

$ awk '{x=x+($2*$3)} {print x}' inventory
39.95
99.93
125.06
133.24
198.69
2042.46
2194.46
2217.66
2241.66
$

Thus, the value of the 1,174 items is $2,241.66. Although this procedure is good for getting a total, it does not look at all pretty, and it would need sanitizing for an actual report. Sprucing up the display a bit can be easily accomplished with a few additions:

$ awk '{x=x+($2*$3)}{print $1,"QTY: "$2,"PRICE: "$3,"TOTAL: "$2*$3,"BAL: "x}' inventory
hammers QTY: 5 PRICE: 7.99 TOTAL: 39.95 BAL: 39.95
drills QTY: 2 PRICE: 29.99 TOTAL: 59.98 BAL: 99.93
punches QTY: 7 PRICE: 3.59 TOTAL: 25.13 BAL: 125.06
drifts QTY: 2 PRICE: 4.09 TOTAL: 8.18 BAL: 133.24
bits QTY: 55 PRICE: 1.19 TOTAL: 65.45 BAL: 198.69
saws QTY: 123 PRICE: 14.99 TOTAL: 1843.77 BAL: 2042.46
nails QTY: 800 PRICE: .19 TOTAL: 152 BAL: 2194.46
screws QTY: 80 PRICE: .29 TOTAL: 23.2 BAL: 2217.66
brads QTY: 100 PRICE: .24 TOTAL: 24 BAL: 2241.66
$

This procedure gives a listing of each record while assigning a total value to the inventory and keeping a running balance of the store's inventory.

<h4>BEGIN and END</h4>

Actions can be specified to take place prior to the actual start of processing or after it has been completed with BEGIN and END statements respectively. BEGIN statements are most commonly used to establish variables or display a header. END statements, on the other hand, can be used to continue processing after the program has finished.

In an earlier example, a complete value of the inventory was generated with the routine

awk '{x=x+($2*$3)} {print x}' inventory

This routine provided a display for each line in the file as the running total accumulated. There was no other way to specify it, and not having it print at each line would have resulted in it never printing. With an END statement, however, this problem can be circumvented:

$ awk '{x=x+($2*$3)} END {print "Total Value of Inventory: "x}' inventory
Total Value of Inventory: 2241.66
$

The variable x is defined, and it processes for each line; however, no display is generated until all processing has completed. While it's useful as a standalone routine, it an also be put with the earlier listing to add even more information and a more complete report:

$ awk '{x=x+($2*$3)} {print $1,"QTY: "$2,"PRICE: 
    "$3,"TOTAL: "$2*$3} END {print "Total Value of Inventory: " x}' inventory

hammers QTY: 5 PRICE: 7.99 TOTAL: 39.95
drills QTY: 2 PRICE: 29.99 TOTAL: 59.98
punches QTY: 7 PRICE: 3.59 TOTAL: 25.13
drifts QTY: 2 PRICE: 4.09 TOTAL: 8.18
bits QTY: 55 PRICE: 1.19 TOTAL: 65.45
saws QTY: 123 PRICE: 14.99 TOTAL: 1843.77
nails QTY: 800 PRICE: .19 TOTAL: 152
screws QTY: 80 PRICE: .29 TOTAL: 23.2
brads QTY: 100 PRICE: .24 TOTAL: 24
Total Value of Inventory: 2241.66
$

The BEGIN command words in the same fashion as END, but it establishes items that need to be done before anything else is accomplished. The most common purpose of this procedure is to create headers for reports. The syntax for this routine would resemble

$ awk 'BEGIN {print "ITEM   QUANTITY   PRICE   TOTAL"}'

Input, Output, and Source Files

The AWK tool can read its input from a file, as was done in all examples up to this point, or it can take input from the output of another command. For example:

$ sort emp_names | awk '{print $3,$2}'

The input of the awk command is the output from the sort operation. In addition to sort, any other Linux command can be used — for example, grep. This procedure allows you to perform other operations on the file before pulling out selected fields.

Like the shell, AWK uses the output-redirection operators > and >> to put its output into a file rather than to standard output. The symbols react like their counterparts in the shell, so > creates the file if it doesn't exist, and >> appends to the existing file. Examine the following example:

$ awk '{print NR, $1 ) > "/tmp/filez" }' emp_names
$ cat /tmp/filez
1   46012
2   46013
3   46015
4   46017
5   46018
6   46019
7   46021
8   46022
9   46024
10  46026
11  46027
12  46029
$

Examining the syntax of the statement, you can see that the output redirection is done after the print statement is complete. You must enclose the file name in quotes, or else it is simply an uninitialized AWK variable, and the combination of instructions generates an error from AWK. (If you use the redirection symbols improperly, AWK gets confused about whether the symbol means "redirection" or is a relation operator.)

Output into pipes in AWK also resembles the way the same action would be accomplished in a shell. To send the output of a print command into a pipe, follow the print command with a pipe symbol and the name of the command, as in the following:

$ awk '{ print $2 | "sort" }' emp_names
BOGUE
BUCK
DULANEY
DURHAM
FELDMAN
FERGUS
JUNE
KANE
STEEN
SWIM
TUTTLE
WOOD
$

As was the case with output redirection, you must enclose the command in quotes, and the name of the pipe is the name of the command being executed.

Commands used by AWK can come from two locations. First, they can be specified on the command line, as illustrated. Second, they can come from a source file. If such is the case, AWK is alerted to this occurrence by means of the -f option. To illustrate:

$ cat awklist

{print $3,$2}
{print $4,$5,"\n"}
$

$ awk -f awklist emp_names
EVAN DULANEY
MOBILE AL

JEFF DURHAM
MOBILE AL

BILL STEEN
MOBILE AL

EVAN FELDMAN
MOBILE AL

STEVE SWIM
UNKNOWN AL

ROBERT BOGUE
PHOENIX AR

MICAH JUNE
PHOENIX AR

SHERYL KANE
UNKNOWN AR

WILLIAM WOOD
MUNCIE IN

SARAH FERGUS
MUNCIE IN

SARAH BUCK
MUNCIE IN

BOB TUTTLE
MUNCIE IN

$

Notice that the apostrophes are not used anywhere within the source file or when calling it at the command line. They are only for use in differentiating the commands on the command line from file names.

If simple output cannot handle the intricate detail you want in your programs, try the more complex output available with the printf command, the syntax for which is

printf( format, value, value ...)

This syntax is like that of the printf command in the C language, and the specifications for the format are the same. You define the format by inserting a specification that defines how the value is to be printed. The format specification consists of a % followed by a letter. Like the print command, printf does not have to be enclosed in parentheses, but using them is considered good practice.

The following table lists the various specifications available for the printf command.

Specification   Description
%c  Prints a single ASCII character
%d  Prints a decimal number
%e  Prints a scientific notation representation of numbers
%f  Prints a floating-point representation
%g  Prints %e or %f; whichever is shorter
%o  Prints an unsigned octal number
%s  Prints an ASCII string
%x  Prints an unsigned hexadecimal number
%%  Prints a percent sign; no conversion is performed

Regex Introduction

Disclaimer: This article is written by Lara Hopley and Jo van Schalkwyk and published at http://www.anaesthetist.com/mnm/perl/regex.htm. All credits on this article go to Lara Hopley and Jo van Schalkwyk. Its posted here only for personal reference and preservation reasons.

 

What is regex?

A regular expression (or regex) is a simple, rather mindless way of matching a series of symbols to a pattern you have in mind. As we discuss elsewhere, there are certain patterns that you cannot match with regex. But mostly, if you want to find a pattern in text, regex is the way to go, and Perl regex will get you there. As the Perl manual says:

"Perl is an interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information"

Regex has been around for some time - those who have struggled with computer theory (in basic computing courses at university) will know it well. Actually, it's not that bad. The basic ideas are simple, but powerful.

Basic ideas

The rules are simple:

  1. We want to know whether a text string matches a pattern. A simple 'yes' or 'no' will do nicely, thank you very much.
  2. Every pattern we want a match for, we will turn into a 'finite state machine'
  3. We will feed the string we want to check into the machine representing our pattern, and the machine will miraculously spit out either yes or no.

And that's it. Well, not quite, for we still have to say how we're going to specify our patterns (and if we're really keen, perhaps look at the machines that are manufactured according to our specifications).

Matches

Using regex, we will at some stage want to match a particular string. Let's say we have a body of text (however long) and want to find if it contains the string "blahblah" at some point. If we find "blahblah", we will return true, otherwise we will fail. Here's the regex:

 /blahblah/ 

Not too bad, was it? All we do is place the text we wish to find in between two (forward) slashes, and Perl does the rest. Note that this regex will match each of the following:

blahblah

I'mso bordedblahblah

and so blahblah am I!

As long as the string is somewhere in the text, we have a match!

We lied(!)

Okay, in the example above, we lied just a little. If we have a string in Perl, we have to store it somewhere. Let's say we have the string "ABCblahblahDEF" stored in a Perl string called $mystring - how do we test for the presence of "blahblah" in the string? Actually, it's like this:

$mystring=~/blahblah/ ;

We specify the string to be tested by writing its name, followed by =~ and then the regex. Note that the above test will return a value of true or false, so we might include the test in some actual Perl code as follows:

if ( $mystring=~/blahblah/ )
         { print "Hooray it worked!\n";
         }; 

A tiny aside: note that you can turn around the sense of the regex, so that true becomes false and vice versa, simply by saying:

$mystring !~/blahblah/;

in other words, !~ negates the regex.

Anchoring the regex

If we want to anchor the search, so that the text has to start with "blahblah", then we can say:

/^blahblah/ 

and similarly, if we insist that the text ends with "blahblah", then we put in a dollar sign at the end:

/blahblah$/ 

Okay, the makers of Perl could presumably have thought up more mnemonic symbols, but, they work (and are now time-hallowed). Get to know them.

Matching Anything

Let's say we have now become a bit more ambitious, and wish to match any one of a set of characters. We might want to match one of several words, for example the words "shot" and "shut". The regex is:

/sh[ou]t/ 

We put the several options in square brackets! The above would match shot or shut, but not, say, "shxt". Note that "shout" would NOT be matched - the square brackets select between single options! What if we want to match any single character? Try:

/sh.t/ 

The dot (period, if you wish) can be taken to match any character whatsoever! {Note: except a 'newline', but that's for later}.

Matching several characters

Let's set our sights even higher. Say we wished to 'match' several characters in the middle, for example, "shaaaaght", "shxxt", or even perhaps "sh$##$#$#@@t". How do we do this? Thus:

/sh.+t/ 

The + sign tells Perl to 'match one or more of the preceding character'. As the preceding character was "anything", Perl looks for one or more "anythings", followed by a t ! Ask yourself, what would the following match?

/sho+t/ 

Clearly, shot, shoot, and even shooooooooooot. The question you have to ask yourself is "How do I match zero or more characters?" How could we match say the text string "sht" and also "shot", "shut", and so on? (Which will clearly fail if we try something like /sh.+t/ ). The answer is:

/sh.*t/ 

You have to be careful with this * thingy. You'll find that thinking in terms of "matching absolutely no characters" is sometimes a little tricky, and if you're not careful, you could end up bashing your head quite hard against your keyboard.

Escaping confusion

By now, you're probably saying to yourself "What if I want to match one of those fancy characters you've been using - for example, ^ . / * + and so on?" A problem, but not insuperable. Let's say you want to look for the text string "a + b". If you say:

/a + b/ 

Then you will get a match to an "a" followed by one or more blanks, followed by yet another blank, and then a "b", but you certainly won't get what you want! The solution is:

/a \+ b/ 
  • we use a simple \ (backslash) to indicate that the subsequent character (Here the "+") is to be regarded as something to match, and not some fancy control character. We say that we escape the "+" character. You can do similar things with "\/" "." "[" and so on.

When in doubt, it's probably best to escape. It may not look pretty, but remember that Perl uses an awful lot of characters as special controls. We will soon encounter more!

Case iNseNSItivITy, and more..

Perl is by default CASE SENSITIVE. For example:

/sensitive/ 

will return a match for "I am sensitive, dammit" but will NOT match "I am Sensitive, dammit". It is however easy to render a match case insensitive, thus:

/sensitive/i 
  • all you need do is put the modifier i after the second slash of the regex, and - Voila - case i nsensitivity! There are other modifiers:

    m - multiple lines (Discussed below)
    s - Treat the whole string as one line, so that even /./ will match a "newline" character.
    x - a rather complex modifier that we will (for now) avoid like the plague!

{note: look up 'locales' for more information about /i modifier; also have a note on $* = }

More matching tricks

There are several more tricks that you will encounter in Perl (Nobody ever accused Perl of lacking options, did they)? Here are a few:

? - matches zero or one of the preceding character
{n} - matches n copies of the preceding character!
{n,m} - matches at least n but not more than m copies of the preceding character
{n,} - matches at least n copies of the preceding character.

I would generally avoid most of the above, except where absolutely necessary. Keep it simple.

Greedy matching

It's unfortunate that the "?" character is used to match 'one or none' of the preceding characters, for "?" has quite a distinct use. Consider the regex

/a.+b/ 

and then apply it to the string "a xxx b fjdlfkjdl b". Clearly, there is a match, but is the match with "a xxx b" or with "a xxx b fjdlfkjdl b"? Your initial answer might be "Who cares?", but there is a good reason for our obsessive questioning. We will soon discover how to pull out a matched string, and then things will get really interesting. First, let's resolve our dilemma. The answer is:

Perl by default uses 'greedy' matching

What this means is that /a.+b/ matches the whole darn string, not just "a xxx b". Perl stuffs as much as it can into the match, unless we specifically tell it to be "stingy"! How do we make Perl parsimonious? Easy, we turn off greedy matching using a

?

after the * or +, thus:

/a.+?b/ 

which will then match "a xxx b" when we feed in the above string. You can even say things like:

/a??b/ 

... which we'll leave as an exercise for you to work out! But let's now keep our promise, and tell you how to..

Extract text from the match

It's easy to extract information from part of a match. Consider the regex:

/alpha(.+)gamma/ 

The above clearly will match a string such as "xxalphazzzgamma", as well as "alpha beta gamma delta". But what do the (parentheses) achieve? The answer is simple - everything in parenthesis is put into the Perl variable $1. (If you have a second set of parentheses, the contents of this set go into $2, and so on). So after we feed "xxalphazzzgamma" into our regex, $1 becomes "zzz". Likewise, for our second example, $1 becomes " beta ".

It's even possible to reuse (!) the value that goes into $1 inside the very same regex ! To do so, we use a very special convention, instead of saying "$1" within the regex, we instead say:

\1

Which translates as "the value of $1 we've just found, thank you very much". Note that we wrote a backslash followed by a one (not an ell). Let's try an example. First consider the HTML code:

"etc"

... and we wished to pull out the title (The stuff in between the tags). We might say..

// 

Okay, straightforward, isn't it? We find the opening title tag, and then the closing one, and grab the stuff in between into $1. (Incidentally, note how we escaped the "/" character, so that Perl didn't become confused and think "Aha! This is the end of the regex"). But what if we want to get a bit more fancy, and identify the start of any HTML tag, and then its closure. Consider:

"this is bold italic, so there"

We can find the opening <b> tag, and then its closure, by saying:

/<(.+?)>.+<\/\1>/

This looks rather intimidating, until we realise that we have simply used \/ as above to escape the "/", and that \1 is a reference to the value that we've previously grabbed into $1. We now have a way of matching a tag and its closure, without specifying a specific tag such as <title> !

Matching fancy characters

There are many special characters and conventions in Perl. A backslash, followed by an alphabetical character, is commonly used to match newline characters. We will present two tables, one a lot more useful than the other. But before we begin, let's note that:

/[c-q]/ 

is the same as saying

/[cdefghijklmnopq]/

and

/[a-d0-4]

is the same as

[abcd01234]

We can also say "Give me anything OTHER THAN the following.." using the convention

/[^0-4]/ 

which translates as "match any character that is NOT one of [01234]".

Useful Perl characters

Character   Meaning

\n  newline (line feed)

\w  a word character [a-zA-Z0-9_]
\W  NOT a word character, that is [^a-zA-Z0-9_]
\s  white space (new line, carriage return, space, tab, form feed)
\S  NOT white space
\d  a digit [0-9]
\D  NOT a digit, i.e. [^0-9]

See how we Capitalise a special character to reverse its meaning. Now here's a really rather frightening list of other characters and conventions:

Obscure Perl special characters

\t  tab (HT, TAB)

\r  return (CR)
\f  form feed (FF)
\a  alarm (bell) (BEL)
\e  escape (think troff) (ESC)
\033    octal char (think of a PDP-11)
\x1B    hex char

\c[ control char
\l  lowercase next char (think vi)
\u  uppercase next char (think vi)
\L  lowercase till \E (think vi)
\U  uppercase till \E (think vi)
\E  end case modification (think vi)

\Q  quote (disable) pattern metacharacters till \E

The above table was swiped from the Perl monks. Don't get too intimidated by this second table. The main characters you will use will be \Q and \E (see below), and possibly \e. {"vi" is a UNIX editor, now often represented on UNIX and related systems by the excellent vim smile and few even remember what a PDP-11 was}!

Yet more matching

Say you wanted to match something that is at the start or end of a word, or a string. Perl even has fancy conventions for these:

\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
\G Match only where previous m//g left off (works only with /g) 

Convenient Perl conventions

Because Perl by default uses the / character to start and end regex, any string that contains multiple slashes soon starts to look like a forest:

http://www.anaesthetist.com/icu/index.htm becomes:

/http:\/\/www\.anaesthetist\.com\/icu\/index\.htm/

... far from attractive. Perl allows us to substitute a different character for the conventional / that delimits regex. For example if we wanted to use the # character, we could say:

m#http://www\.anaesthetist\.com/icu/index\.htm#

Think of m as standing for m atch. Note that we still have the irritating . escape of the period character. We can even get rid of this:

m#\Qhttp://www.anaesthetist.com/icu/index.htm\E#

We used the \Q..\E convention from our list above to quote absolutely everything from after the \Q until the \E is encountered. (By the way, this quoting automatically gets turned off when the delimiter character is encountered).

Perl Substitution

The format for substitution is simple:

s/Anne/Jim/

Which means that we want to substitute "Jim" for "Anne" wherever Anne occurs in the given string.

We lied again (!)

Okay, how do we specify the string to assault? Here it is:

$mystring=~s/Anne/Jim/ 

In other words, we simply use our standard regex convention (=~), but place an s between the =~ and the regex itself. Think s for s ubstitute.

Tricks and traps

Note that if you use the above substitution command, only one substitution is made! You can substitute globally throughout the string using:

s/Anne/Jim/g 

where g stands for global. Can you guess what

s/Anne/Jim/ig 

does? Yes, as for regex above, i makes things case InSENsitIvE? ! Note that you have to be careful, for Perl won't worry whether a string is, for example, within a word. If you try and substitute "is" for "was" in the string "This is silly" using

s/is/was/ig 

you won't get "This was silly", you'll get "Thwas was silly".

Global matching

In Perl it's even possible to use the /g switch for pattern searching, without performing a substitution! At first viewing, this statement doesn't seem to make sense. For who cares if there is one match, or several? In fact, we should care, for it's possible to actually pull out ALL of the matches into a list! Thus:

$_ = "alpha xbetay xgammay xdeltay so on";
($first, $second, $third) = /x(.+?)y/g;

will put beta, gamma and delta into $first, $second and $third respectively! The above needs some explanation:

  1. The reason this works is because by default regex acts on the default pattern searching variable, known to its many friends as:
    $_
    We first set $_ to the string we wish to test.

  2. Perl arrays are described using parentheses, so ($first, $second, $third) is an array to be filled up with goodies.

  3. Perl understands that when we say =, it mustn't simply throw away the results of its pattern searching, but rather put each result (remember that we said /g ) into the corresponding element of the array.

You can even use a global test within a while statement thus:

while ( /x(.+?)y/g )
    { /* here do something */ };

... but watch out - if you leave out the /g then the statement will loop forever!

Perl has another operator called split . This is most useful in splitting up a string into component parts, using a specified delimiter, something along the lines of:

@info = split /;/ , $fred ;           #use semicolon delimiter to split $fred

Note that the array @info is filled with the resulting components. The second 'argument' of split is the name of the string to split, in this case, $fred. If we were to say

$count = split /;/, $fred ;  

then we would get back the number of components returned, but the actual values would be thrown away! (It's possible to supply a third paramenter to split - the number of elements you want returned).

The opposite of split is the join operator:

$fred = join ';', $alpha,$beta,$gamma;

You can also use an array instead of a comma-delimited list as above.

Another use for (parenthesis)

Consider the following regex:

/A(dam|nne|ndrew)/ 

What does it do? In many computer languages, a vertical pipe ( | ) means OR, and Perl is no exception. The regex matches the names "Adam", "Anne" OR "Andrew" - any one will do. There is however a cost - because we used parenthesis, $1 is created, and filled with the value Adam, Anne or whatever. This is wasteful, so in the interests of efficiency, we have the following alternative convention (which doesn't create $1):

/A(?:dam|nne|ndrew)/ 

There is yet another convention (Perl is stuffed with them, isn't it) that allows us to pull text out of a string without using parenthesis ! If we use the "variables" $`, $& and $' just after some regex, then they will respectively contain (1) the text before the match, (2) the matched text, and (3) the text after the match. Avoid them - there's a time penalty if you use them at all. The terrible thing is that if you use any of these variables anywhere in your program, Perl will provide them for all regex! (An aside: $+ returns the most recent parenthesis variable match).

Evaluating the result!

You can do rather complex things in Perl. For example, substitute numbers you've identified with an arithmetic modification of those numbers! For an example, see Robert's Perl tutorial. The magic switch that allows this evaluation is /e .

pdf-Tricks in Linux with pdftk

pdftk is a very useful and powerful pdf-toolkit. It is available for Ubuntu with the package pdftk. Here are some practical example if its usage.

  • Merge Two or More PDFs into a New Document:

    pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf 
  • or (Using Handles):

    pdftk A=1.pdf B=2.pdf cat A B output 12.pdf 
  • or (Using Wildcards):

    pdftk *.pdf cat output combined.pdf 
  • Split Select Pages from Multiple PDFs into a New Document:

    pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf 
  • Rotate the First Page of a PDF to 90 Degrees Clockwise:

    pdftk in.pdf cat 1E 2-end output out.pdf 
  • Rotate an Entire PDF Document's Pages to 180 Degrees:

    pdftk in.pdf cat 1-endS output out.pdf 
  • Encrypt a PDF using 128-Bit Strength (the Default) and Withhold All Permissions (the Default):

    pdftk mydoc.pdf output mydoc.128.pdf owner_pw foopass 
  • Same as Above, Except a Password is Required to Open the PDF:

    pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz 
  • Same as Above, Except Printing is Allowed (after the PDF is Open):

    pdftk mydoc.pdf output mydoc.128.pdf owner_pw foo user_pw baz allow printing 
  • Decrypt a PDF:

    pdftk secured.pdf input_pw foopass output unsecured.pdf 
  • Join Two Files, One of Which is Encrypted (the Output is Not Encrypted):

    pdftk A=secured.pdf mydoc.pdf input_pw A=foopass cat output combined.pdf 
  • Uncompress PDF Page Streams for Editing the PDF Code in a Text Editor:

    pdftk mydoc.pdf output mydoc.clear.pdf uncompress 
  • Repair a PDF's Corrupted XREF Table and Stream Lengths (If Possible):

    pdftk broken.pdf output fixed.pdf 
  • Burst a Single PDF Document into Single Pages and Report its Data to doc_data.txt:

    pdftk mydoc.pdf burst 
  • Report on PDF Document Metadata, Bookmarks and Page Labels:

    pdftk mydoc.pdf dump_data output report.txt