Homework 6. Python join

Motivation

Your software team at Amalgamated Data Associates has finished validating its scripts against your utility option usage checker. But you now have a new problem: the application doesn't fit into the limited amount of flash memory that is available in the target embedded GNU/Linux system. You don't want to modify the application, since your team has spent a huge sum of money testing and validating it. So instead you look for ways to shrink the size of the underlying GNU/Linux system. You're willing to give up some runtime performance, so long as the overall system fits into flash memory.

The overall system contains several components, and each member of your team is assigned a set of components to put on a memory diet. Your job is to shrink the GNU Core Utilities (coreutils), which implement many POSIX utilities needed by your application. As is typical in such cases, your application uses BusyBox when possible, but BusyBox lacks some of the standard utilities that your application needs, so you've been using coreutils to fill in the gaps, and those coreutils applications are taking up substantial space.

You try recompiling coreutils with various options, but that doesn't get you enough shrinkage.

Your boss then suggests that you rewrite the biggest remaining utilities yourself, from scratch, in Python. The idea is that then you need just one extra executable in flash memory (the Python interpreter itself), and all the other utilities can be small scripts that don't take up much flash memory. Your boss mentions that someone else at the company has already implemented the POSIX uniq command by inspecting the coreutils C source code for uniq and using it as inspiration to write a Python version. Your boss suggests that you try your hand at implementing one of the problematic coreutils utilities to confirm whether this is a feasible approach.

Assignment

Implement the POSIX join utility in Python. You may refer to the coreutils C source code for join, but your goal is not to write a literal translation of the C code: instead, it is to come up with a Python version that is short, simple, and easily maintainable.

You need not worry about implementing the GNU extensions to POSIX join. It is enough to implement only the standard behavior required by POSIX.

You need not worry about multibyte characters for this assignment. That is, your implementation can assume that all command-line arguments and all data are in 7-bit ASCII. Also, you may assume that each input file is a valid text file; that is, if it is nonempty then it ends in a newline character.

To turn in your assignment, submit a file join.py containing your modified version of uniq. The first line of join.py should be a comment containing your name and student ID.

Make sure that your code works with the Python 2.4b2 installation on SEASnet. The shell command python -V should output the string Python 2.4b2. If it doesn't, please prepend the directory /usr/local/python-2.4b2/bin to your PATH.

Examples

When in doubt about what your implementation of python join.py should do, just see what the standard join command does. The following examples use python join.py, but they would generate the same output if you used plain join instead.

On SEASnet the shell commands:

sort -t : -k 1 /etc/passwd >p
ls /var/mail | python join.py -t : -o 2.4,2.5,2.7 - p | (head; echo ...; tail)

should output text that looks something like this (the exact data may be different, depending on the current state of SEASnet):

132:Aaron Takeo Aoyama:/bin/csh
130:Abhishake Pathak:/usr/local/bin/tcsh
130:Luis Antonio  Jr Aceves:/bin/csh
132:Adam John Vogel:/bin/csh
130:Adil Aijaz:/bin/csh
130:Aditya Pashupati Advani:/bin/csh
130:Adriana Magana:/usr/local/bin/bash
130:Fred Agourian:/bin/csh
132:Jonathan Ahmadi:/bin/csh
131:Andrew Sung-Joon Ahn:/bin/csh
...
130:Zahra Ghofraniha:/bin/csh
130:Victor Javier Zaragoza:/bin/csh
131:Zeeshawn Syed Shameem:/bin/csh
131:Michael Zeltser:/bin/csh
131:Cai Zeng:/bin/csh
130:Zeya Myint:/bin/csh
131:Chong Da Zhao:/bin/csh
130:Derek Zhong-Dirk Chu:/bin/csh
130:Xin Kevin Zhuang:/bin/csh
132:Daniel T Zing:/bin/csh

Similarly, the commands:

sort -t : -k 2 /usr/local/lib/seasnet/xxlist >x
sort -t : -k 1 /etc/group | python join.py -t : -2 2 -o 0,1.3,2.5 - x | head

should output text that looks like this:

bm101:651:Daniel Kamei
bm101:651:Daniel Kamei
bm202:739:Chien-Pin Sun
bm202:739:Chien-Pin Sun
bm298:808:Toshikazu Hamasaki
bm298:808:Toshikazu Hamasaki
ce108:407:Hu Wei
ce137:276:Kuo-Yao Yuan
ce137l:317:Leonardo Sanchez
ce150:309:Steven Margulis

Notes

Thanks to Albert Chern for these questions.


© 2004 Paul Eggert. See copying rules.
$Id: hw6.html,v 1.5 2004/11/19 19:31:11 eggert Exp $