Your software team at Amalgamated Data Associates has finished validating its scripts against your utility option usage checker. But you now have a new problem: the application doesn't fit into the limited amount of flash memory that is available in the target embedded GNU/Linux system. You don't want to modify the application, since your team has spent a huge sum of money testing and validating it. So instead you look for ways to shrink the size of the underlying GNU/Linux system. You're willing to give up some runtime performance, so long as the overall system fits into flash memory.
The overall system contains several components, and each member of your team is assigned a set of components to put on a memory diet. Your job is to shrink the GNU Core Utilities (coreutils), which implement many POSIX utilities needed by your application. As is typical in such cases, your application uses BusyBox when possible, but BusyBox lacks some of the standard utilities that your application needs, so you've been using coreutils to fill in the gaps, and those coreutils applications are taking up substantial space.
You try recompiling coreutils with various options, but that doesn't get you enough shrinkage.
Your boss then suggests that you rewrite the biggest remaining utilities yourself, from scratch, in Python. The idea is that then you need just one extra executable in flash memory (the Python interpreter itself), and all the other utilities can be small scripts that don't take up much flash memory. Your boss mentions that someone else at the company has already implemented the POSIX uniq command by inspecting the coreutils C source code for uniq and using it as inspiration to write a Python version. Your boss suggests that you try your hand at implementing one of the problematic coreutils utilities to confirm whether this is a feasible approach.
Implement the POSIX join utility in Python. You may refer to the coreutils C source code for join, but your goal is not to write a literal translation of the C code: instead, it is to come up with a Python version that is short, simple, and easily maintainable.
You need not worry about implementing the GNU extensions to POSIX join. It is enough to implement only the standard behavior required by POSIX.
You need not worry about multibyte characters for this assignment. That is, your implementation can assume that all command-line arguments and all data are in 7-bit ASCII. Also, you may assume that each input file is a valid text file; that is, if it is nonempty then it ends in a newline character.
To turn in your assignment, submit a file join.py containing your modified version of uniq. The first line of join.py should be a comment containing your name and student ID.
Make sure that your code works with the Python 2.4b2 installation on SEASnet. The shell command python -V should output the string Python 2.4b2. If it doesn't, please prepend the directory /usr/local/python-2.4b2/bin to your PATH.
When in doubt about what your implementation of python join.py should do, just see what the standard join command does. The following examples use python join.py, but they would generate the same output if you used plain join instead.
On SEASnet the shell commands:
sort -t : -k 1 /etc/passwd >p ls /var/mail | python join.py -t : -o 2.4,2.5,2.7 - p | (head; echo ...; tail)
should output text that looks something like this (the exact data may be different, depending on the current state of SEASnet):
132:Aaron Takeo Aoyama:/bin/csh 130:Abhishake Pathak:/usr/local/bin/tcsh 130:Luis Antonio Jr Aceves:/bin/csh 132:Adam John Vogel:/bin/csh 130:Adil Aijaz:/bin/csh 130:Aditya Pashupati Advani:/bin/csh 130:Adriana Magana:/usr/local/bin/bash 130:Fred Agourian:/bin/csh 132:Jonathan Ahmadi:/bin/csh 131:Andrew Sung-Joon Ahn:/bin/csh ... 130:Zahra Ghofraniha:/bin/csh 130:Victor Javier Zaragoza:/bin/csh 131:Zeeshawn Syed Shameem:/bin/csh 131:Michael Zeltser:/bin/csh 131:Cai Zeng:/bin/csh 130:Zeya Myint:/bin/csh 131:Chong Da Zhao:/bin/csh 130:Derek Zhong-Dirk Chu:/bin/csh 130:Xin Kevin Zhuang:/bin/csh 132:Daniel T Zing:/bin/csh
Similarly, the commands:
sort -t : -k 2 /usr/local/lib/seasnet/xxlist >x sort -t : -k 1 /etc/group | python join.py -t : -2 2 -o 0,1.3,2.5 - x | head
should output text that looks like this:
bm101:651:Daniel Kamei bm101:651:Daniel Kamei bm202:739:Chien-Pin Sun bm202:739:Chien-Pin Sun bm298:808:Toshikazu Hamasaki bm298:808:Toshikazu Hamasaki ce108:407:Hu Wei ce137:276:Kuo-Yao Yuan ce137l:317:Leonardo Sanchez ce150:309:Steven Margulis
Thanks to Albert Chern for these questions.
However, what about the case where -t char is also specified? The spec states that when -t char is specified, every occurrence of char is significant. Wouldn't join then be able to detect an empty field? For example, if file 1 contains a##b and file 2 contains a#c should the command join -t# -eEMPTY 1 2 output a#EMPTY#b#c? That is what Solaris does, but my BSD machine omits the EMPTY.The -e option is only effective when used with -o because, unless specific fields are identified using -o, join is not aware of what fields might be empty.