require("../../2041.php"); echo startPage("Assignment 1"),updateBlurb(); ?>
give cs2041 ass1 students.php staff.php courses.php
How to do this assignment:
UNSW finally introduced on-line evaluation of courses in second semester of 2006. This was done in the context of a course evaluation and feedback process called CATEI (Course and Teaching Evaluation and Improvement). The CATEI website provides a number of interfaces related to the setting up, completion and analysis of course and teaching evaluation.
In the setting up phase, course convenors (otherwise known as "lecturers in charge") activate one or more evaluations for their course. The possible kinds of evaluations:
Unfortunately, the process of activating evaluations provided on the current CATEI website is extremely tedious, and, in fact, discourages staff from setting up evaluations. To make sure that course evaluations are done, we have decided to automate the process of setting up evaluations, so that all courses are evaluated. This requires us to automatically collect all of the data that the CATEI website currently collects from the convernor. This data is spread around the UNSW web site and so the goal of this assignment is to write some PHP scripts that can extract this information and integrate it into a usable format.
What we need:
With this information, we should be able to set up appropriate evaluations for all courses. E.g. if a course has 200 students and runs lecture classes, we'd schedule a Form A and Form B evaluation for it, with the Form B being tied to the course convenor.
The script for the first task aims to collect enrolment information from the download files that we receive each night from the NSS (myUNSW) system. We have a collection of (randomised) enrolment files for the current semester in the directory:
/home/cs2041/web/07s2/ass/1/enrolments
The data files in this directory contain enrolment data for all BINF, COMP, ENGG, and SENG courses. Each line in these files represents the enrolment of one student in one course. Some sample lines from the 07s2_COMP file:
COMP4121|3109977|Fenton, Patrick |3978/3|COMPK1 |07s2|19770609|M COMP4181|3109977|Fenton, Patrick |3978/3|COMPK1 |07s2|19770609|M COMP9315|3109977|Fenton, Patrick |3978/3|COMPK1 |07s2|19770609|M COMP4181|3143372|Fenton, Shing Wiratama |3648/4|SENGA1 |07s2|19811209|M COMP9315|3143372|Fenton, Shing Wiratama |3648/4|SENGA1 |07s2|19811209|M COMP1921|3114178|Fernandes, Joseph Rohanraj Ki |3978/2|COMPA1 |07s2|19860326|M COMP2041|3114178|Fernandes, Joseph Rohanraj Ki |3978/2|COMPA1 |07s2|19860326|M COMP3222|3114178|Fernandes, Joseph Rohanraj Ki |3978/2|COMPA1 |07s2|19860326|M COMP3511|3114178|Fernandes, Joseph Rohanraj Ki |3978/2|COMPA1 |07s2|19860326|M COMP9031|3183896|Fernandez, Irwin |8682 |COMPCS |07s2|19771123|M COMP9101|3183896|Fernandez, Irwin |8682 |COMPCS |07s2|19771123|M COMP9311|3183896|Fernandez, Irwin |8682 |COMPCS |07s2|19771123|M COMP9336|3183896|Fernandez, Irwin |8682 |COMPCS |07s2|19771123|M COMP2121|3186917|Fernandez, Kah Ben |3715/2|COMPB1 UNDL-U|07s2|19880808|M
The files are sorted on student name, and each line contains the following fields:
Exercise: write a PHP script called students.php which scans all of the enrolment files and produces a list of student enrolments on its standard output. Each line of the output should contain a course code and a student id, separated by a single tab character. The output should be sorted by course code, and by student id within each course code.
Hints: The output of the students.php script should be identical to what you would obtain if you changed into the directory containing the enrolment data and ran the following pipeline:
cat 07s2_???? | awk -F'|' '{print $1"\t"$2}' | sort -k1 -k2
except, of course, that it is achieved by a single PHP script. (P.S. It is not an acceptable solution to write a PHP script that simply invokes the above shell pipeline; the goal is for you to manipulate the data using PHP arrays.)
Sample output for this task is available in the file
/home/cs2041/web/07s2/ass/1/enrolments/expected # the first ten lines of the above file ... BINF1001 3160311 BINF1001 3207818 BINF1001 3207877 BINF1001 3209887 BINF1001 3219152 BINF1001 3220978 BINF1001 3225731 BINF1001 3229158 BINF1001 3245507 BINF1001 3248375If you save the output from your students.php script into a file (let's call it myoutput), you could check whether it's correct via the following command:
diff myoutput /home/cs2041/web/07s2/ass/1/enrolments/expected
If there is no output from this command, your script is generating the correct output.
A partially-completed template for this script is available. The template shows how you should open the enrolment data files. You should save this template to your assignment directory. If it's saved with with name students.php.txt, make sure that you change its name to students.php before you start playing with it. If you want to copy the data files to your home computer and work there, you'll need to change the definition of the $base variable to match local conditions (e.g. if you put the script in the same directory as the data files, you could set $base to the value ".").
The script for this second task aims to collect information about UNSW staff members from the UNSW Online Directory. In order to save "wear and tear" on the real UNSW Online Directory, we have collected some pages from that site and placed modified versions under the COMP2041 web directories. (Since the data is publically available anyway, there are no privacy issue with making this data available).
The "top level" of our copy of the Online Directory is a full list of all UNSW staff, available via:
http://www.cse.unsw.edu.au/~cs2041/07s2/ass/1/staff.html
This page contains simply a list of staff names, roughly in alphabetical order, where each name is a link to a page giving details of the staff member. The pages giving individual details contain the information that we wish to collect, except for the staff IDs. You should examine the staff list and individual staff pages via a Web browser to get a feeling for the kind of data that they contain.
Extracting data from these pages is not as a simple as it was for the structured (column-based) files in Task 1. These web pages are designed for viewing, not as a data repository and so they contain large amounts of HTML code to describe the appearance of the data. However, they do contain the data and, fortunately for us, the pages were produced automatically from a database (which we don't have access to) and so the HTML has a (mostly) regular structure.
A side note: the data in these pages shows the dangers of allowing users to enter arbitrary text values for data items that have a well-defined set of possible values. A simple example of this is the variation within email addresses. The following file contains some versions of unsw.edu.au from the staff pages:
/home/cs2041/web/07s2/ass/1/emails
The numbers are the frequency of occurence of each variation.
It would useful to normalise these while we are processing the data, but doing this properly requires too much hacking, and so it's not required for this exercise. However, you should at least map email addresses to all lower case to remove some variation.
A similar problem existed with titles. The file:/home/cs2041/web/07s2/ass/1/titles
shows some of the variations in the spelling of titles before I cleaned them up. Note that the first line indicates that 117 people have no title specified.
Exercise: write a PHP script called staff.php which scans all of the individual staff web pages and produces a list of staff data on its standard output. Each line of the output should contain the staff member's ID, their full name, their title, their email address, their phone number (maybe just an extension), and the organisational unit with which they are affiliated. The components should occur in the order specified, and should be separated by a single tab character. The output should be sorted by name (i.e. using the order that staff members appear in the staff list). There should be no HTML tags in any of the strings in the output (you can use the PHP strip_tags() library function to ensure this).
Note that the pages we have supplied have no reference to staff ids. The directory uses its own internal ids to distinguish staff members (hence URLs like .../staff/50403612.html). Some of these ids are 7-digits long, others are 8-digits long. In fact, the real Online Directory provides no information about real staff ids, so we will generate fake staff ids by using the directory ids and modifying them to produce a unique 7-digit number for each staff member. The template file (see below) provides a function called idToStaffId() for this; the argument is one of the 7-digit or 8-digit directory ids, and the result is a fake staff id.
The organisation unit names in the directory pages have some quirks.
For example the School of Computer Science and Engineering appears as
"Computer Science and Engineering, School of".
The following list indicates all of the strange
structures in organisation names that need to be transformed into
something more normal:
Also, you should apply the PHP library
function html_entity_decode() to the name in order to
map HTML special notion into normal characters
(e.g. map "&" to "&").
Hints: starting from the full staff list page, you should use the links contained there to visit all of the pages for individual staff members. To process the pages for individual staff members, you will need to work out cues in the HTML to identify where particular pieces of information occur. Some information may be missing; in that case, simply have an empty field (i.e. two adjacent tabs in the output).
Sample output for this task is available in the file:
/home/cs2041/web/07s2/ass/1/staff/expected
# the first ten lines of the above file ...
5341651 Debra Aarons Dr d.aarons@unsw.edu.au 53468 Linguistics Department
5404154 Peter Abakumoft Mr p.abakumoff@unsw.edu.au 0412 689 989 Banking and Finance
5080388 Hussein Abbass Dr h.abbass@adfa.edu.au 88158 School of Information Technology and Electrical Engineering-ADFA
5351683 Sofia Abdallah Mrs sofia@unsw.edu.au 54966 Environmental Studies, Institute of
5036113 Adam Abdool Mr a.abdool@unsw.edu.au 52102 School of Biotechnology and Biomolecular Sciences
5080359 Julian Abel Dr rjabel@unsw.edu.au 57091 School of Mathematics & Statistics
8000100 David Abello Mr d.abello@unsw.edu.au 57831 Social Policy Research Centre
7000007 Armin Aberle Prof a.aberle@unsw.edu.au 54031 School of Photovoltaic and Renewable Engineering
5403612 Samanthi Abeywardana Mrs samanthi@unsw.edu.au 21014 School of Medical Sciences
5392373 Tony Ablong Mr t.ablong@adfa.edu.au 88147 Information Communication and Technology Services
Warning: this file is around 550KB long and takes at least 40 seconds to generate.
A partially-completed template for this script is available. The template shows how you should open the top level staff directory. You should save this template to your assignment directory. If it's saved with with name staff.php.txt, make sure that you change its name to staff.php before you start playing with it.
The script for this third task aims to collect information about UNSW courses from the online timetable system, to build up a collection of data to drive appropriate on-line course evaluation, with no manual setup required. Since we are collecting this information from a live web site, we will not collect data for every course at UNSW, but only for the courses mentioned in the NSS downloads from exercise 1.
Details about courses and classes for each course offering at UNSW is available via the UNSW timetable site:
http://www.timetable.unsw.edu.au/2007/
The above URL is presumably not intended to be accessed directly, since it simply gives a long directory listing of the HTML files for individual courses. Determining a URL to get timetable information for a course is easy. If the course is e.g. COMP2041, then the URL for its timetable page is:
http://www.timetable.unsw.edu.au/2007/COMP2041.html
Note that this contains information about all offerings of the course in 2007. For the purposes of this exercise, we are interested only in the semester two offering.
Exercise: write a PHP script called courses.php that scans the enrolment files from exercise 1 and the timetable pages for all courses mentioned in the enrolment files, and extracts for each course: the course code, the name of the course convenor for the 07s2 offering, the number of students enrolled in 07s2, and the types of classes offered in 07s2. The script should write one line for each course, where the code, convenor, enrolment count and class types are separated from each other by a single tab character. For the convenor, use the "Staff Contact" field in the timetable page rather than the "Instructor" field.
The first ten lines of the output should look like:
BINF1001 Mr BA Gaeta 26 Laboratory,Lecture BINF2001 Dr ME Bain 14 Laboratory,Lecture BINF3001 Mr BA Gaeta 29 Laboratory,Lecture,Tutorial-Laboratory BINF4910 School Office 6 Thesis Research BINF4911 School Office 5 Thesis Research COMP1081 Dr GR Whale 21 Laboratory,Lecture COMP1091 Dr AD Blair 7 Lecture,Tutorial-Laboratory COMP1911 Dr AD Blair 241 Lecture,Tutorial-Laboratory COMP1921 Dr M Pagnucco 323 Lecture,Tutorial-Laboratory COMP2041 Dr JA Shepherd 202 Lecture,Tutorial-Laboratory
Note that class types should be given in alphabetical order (which is the order they appear in the timetable page), and should be comma-separated. Some classes include a sequence (e.g. "Lecture Sequence 1 of 2"); the sequence data should simply be dropped, as in the BINF[123]001 courses.
You should ignore the enrolment numbers in the pages under
http://www.timetable.unsw.edu.au/2007/
For the purposes of our exercise, the "official" enrolment data is contained in the files under
http://www.cse.unsw.edu.au/~cs2041/07s2/ass/1/enrolments/
No template is provided for this exercise. Use your students.php script as the basis for your courses.php script.
Exercise: Write a new script courses1.php that produces the same output as courses.php, except that it also includes the staff id of the course convenor, or "???" if the staff member cannot be recognised. The staff id should become the second column. You should try to minimise the number of "???" staff ids.
Note that you can't submit this using give. If you actually attempt it, email the solution directly to jas@cse.unsw.edu.au
Submit this assignment via the command:
give cs2041 ass1 students.php staff.php courses.php
You must ensure that your .php
files have no syntax
errors.
If I need to manually fix problems with your PHP code in order to
run the testing, you will be fined
via a 2 mark penalty.