Aims

This assignment aims to give you practice in

web text mining and text data manipulation using PHP

The goal is to collect data from various sources and arrange it in a form that can be used to drive a web-based course evaluation system (in Assignment 2).

Summary

Submission command: give cs2041 ass1 students.php staff.php courses.php
This assignment must be submitted before midnight on Wednesday 22nd August.
It is to be implemented and submitted individually.
It contributes 8 marks toward your total mark for this course.
Late penalty: 0.5% marks per hour late off the ceiling mark.

How to do this assignment:

read this specification carefully and completely before starting
familiarise yourself with the formats/content of the data files
make a private directory in which to place your solutions

Introduction

UNSW finally introduced on-line evaluation of courses in second semester of 2006. This was done in the context of a course evaluation and feedback process called CATEI (Course and Teaching Evaluation and Improvement). The CATEI website provides a number of interfaces related to the setting up, completion and analysis of course and teaching evaluation.

In the setting up phase, course convenors (otherwise known as "lecturers in charge") activate one or more evaluations for their course. The possible kinds of evaluations:

Form A: course evaluation
Form B: teacher evaluation for large class teaching ("large" means > 30 students)
Form C: teacher evaluation for small class teaching ("small" means ≤ 30 students)
Form D: teacher evaluation for design studios

A course would typically run a Form A evaluation for the course as a whole, and then Form B/C/D evaluations as appropriate, depending on the kinds of classes run in the course.

Unfortunately, the process of activating evaluations provided on the current CATEI website is extremely tedious, and, in fact, discourages staff from setting up evaluations. To make sure that course evaluations are done, we have decided to automate the process of setting up evaluations, so that all courses are evaluated. This requires us to automatically collect all of the data that the CATEI website currently collects from the convernor. This data is spread around the UNSW web site and so the goal of this assignment is to write some PHP scripts that can extract this information and integrate it into a usable format.

What we need:

information about staff (staff id, name, phone, email, affiliation)
information about student enrolment (student id, course code)
information about courses (course code, convenor, enrolments, class types)

With this information, we should be able to set up appropriate evaluations for all courses. E.g. if a course has 200 students and runs lecture classes, we'd schedule a Form A and Form B evaluation for it, with the Form B being tied to the course convenor.

Exercises

1. students.php (1 mark)

The script for the first task aims to collect enrolment information from the download files that we receive each night from the NSS (myUNSW) system. We have a collection of (randomised) enrolment files for the current semester in the directory:

/home/cs2041/web/07s2/ass/1/enrolments

The data files in this directory contain enrolment data for all BINF, COMP, ENGG, and SENG courses. Each line in these files represents the enrolment of one student in one course. Some sample lines from the 07s2_COMP file:

COMP4121|3109977|Fenton, Patrick                         |3978/3|COMPK1       |07s2|19770609|M
COMP4181|3109977|Fenton, Patrick                         |3978/3|COMPK1       |07s2|19770609|M
COMP9315|3109977|Fenton, Patrick                         |3978/3|COMPK1       |07s2|19770609|M
COMP4181|3143372|Fenton, Shing Wiratama                  |3648/4|SENGA1       |07s2|19811209|M
COMP9315|3143372|Fenton, Shing Wiratama                  |3648/4|SENGA1       |07s2|19811209|M
COMP1921|3114178|Fernandes, Joseph Rohanraj Ki           |3978/2|COMPA1       |07s2|19860326|M
COMP2041|3114178|Fernandes, Joseph Rohanraj Ki           |3978/2|COMPA1       |07s2|19860326|M
COMP3222|3114178|Fernandes, Joseph Rohanraj Ki           |3978/2|COMPA1       |07s2|19860326|M
COMP3511|3114178|Fernandes, Joseph Rohanraj Ki           |3978/2|COMPA1       |07s2|19860326|M
COMP9031|3183896|Fernandez, Irwin                        |8682  |COMPCS       |07s2|19771123|M
COMP9101|3183896|Fernandez, Irwin                        |8682  |COMPCS       |07s2|19771123|M
COMP9311|3183896|Fernandez, Irwin                        |8682  |COMPCS       |07s2|19771123|M
COMP9336|3183896|Fernandez, Irwin                        |8682  |COMPCS       |07s2|19771123|M
COMP2121|3186917|Fernandez, Kah Ben                      |3715/2|COMPB1 UNDL-U|07s2|19880808|M

The files are sorted on student name, and each line contains the following fields:

course code (e.g. COMP2041)
student ID (e.g. 3183896)
student name (e.g. Fernandes, Joseph Rohanraj Ki
program code and stage (e.g. 3978/2)
plan codes (e.g. COMPA1 or UNDL-U)
session (always 07s2 for this enrolment data)
date of birth (e.g. 19771123, in format YYYYMMDD)
gender (M or F)

Exercise: write a PHP script called students.php which scans all of the enrolment files and produces a list of student enrolments on its standard output. Each line of the output should contain a course code and a student id, separated by a single tab character. The output should be sorted by course code, and by student id within each course code.

Hints: The output of the students.php script should be identical to what you would obtain if you changed into the directory containing the enrolment data and ran the following pipeline:

cat 07s2_???? | awk -F'|' '{print $1"\t"$2}' | sort -k1 -k2

except, of course, that it is achieved by a single PHP script. (P.S. It is not an acceptable solution to write a PHP script that simply invokes the above shell pipeline; the goal is for you to manipulate the data using PHP arrays.)

Sample output for this task is available in the file

/home/cs2041/web/07s2/ass/1/enrolments/expected

# the first ten lines of the above file ...

BINF1001	3160311
BINF1001	3207818
BINF1001	3207877
BINF1001	3209887
BINF1001	3219152
BINF1001	3220978
BINF1001	3225731
BINF1001	3229158
BINF1001	3245507
BINF1001	3248375

If you save the output from your students.php script into a file (let's call it myoutput), you could check whether it's correct via the following command:

diff myoutput /home/cs2041/web/07s2/ass/1/enrolments/expected

If there is no output from this command, your script is generating the correct output.

A partially-completed template for this script is available. The template shows how you should open the enrolment data files. You should save this template to your assignment directory. If it's saved with with name students.php.txt, make sure that you change its name to students.php before you start playing with it. If you want to copy the data files to your home computer and work there, you'll need to change the definition of the $base variable to match local conditions (e.g. if you put the script in the same directory as the data files, you could set $base to the value ".").

2. staff.php (3 marks)

The script for this second task aims to collect information about UNSW staff members from the UNSW Online Directory. In order to save "wear and tear" on the real UNSW Online Directory, we have collected some pages from that site and placed modified versions under the COMP2041 web directories. (Since the data is publically available anyway, there are no privacy issue with making this data available).

The "top level" of our copy of the Online Directory is a full list of all UNSW staff, available via:

http://www.cse.unsw.edu.au/~cs2041/07s2/ass/1/staff.html

This page contains simply a list of staff names, roughly in alphabetical order, where each name is a link to a page giving details of the staff member. The pages giving individual details contain the information that we wish to collect, except for the staff IDs. You should examine the staff list and individual staff pages via a Web browser to get a feeling for the kind of data that they contain.

Extracting data from these pages is not as a simple as it was for the structured (column-based) files in Task 1. These web pages are designed for viewing, not as a data repository and so they contain large amounts of HTML code to describe the appearance of the data. However, they do contain the data and, fortunately for us, the pages were produced automatically from a database (which we don't have access to) and so the HTML has a (mostly) regular structure.

A side note: the data in these pages shows the dangers of allowing users to enter arbitrary text values for data items that have a well-defined set of possible values. A simple example of this is the variation within email addresses. The following file contains some versions of unsw.edu.au from the staff pages:

/home/cs2041/web/07s2/ass/1/emails

The numbers are the frequency of occurence of each variation.

It would useful to normalise these while we are processing the data, but doing this properly requires too much hacking, and so it's not required for this exercise. However, you should at least map email addresses to all lower case to remove some variation.

A similar problem existed with titles. The file:

/home/cs2041/web/07s2/ass/1/titles

shows some of the variations in the spelling of titles before I cleaned them up. Note that the first line indicates that 117 people have no title specified.

Exercise: write a PHP script called staff.php which scans all of the individual staff web pages and produces a list of staff data on its standard output. Each line of the output should contain the staff member's ID, their full name, their title, their email address, their phone number (maybe just an extension), and the organisational unit with which they are affiliated. The components should occur in the order specified, and should be separated by a single tab character. The output should be sorted by name (i.e. using the order that staff members appear in the staff list). There should be no HTML tags in any of the strings in the output (you can use the PHP strip_tags() library function to ensure this).

Note that the pages we have supplied have no reference to staff ids. The directory uses its own internal ids to distinguish staff members (hence URLs like .../staff/50403612.html). Some of these ids are 7-digits long, others are 8-digits long. In fact, the real Online Directory provides no information about real staff ids, so we will generate fake staff ids by using the directory ids and modifying them to produce a unique 7-digit number for each staff member. The template file (see below) provides a function called idToStaffId() for this; the argument is one of the 7-digit or 8-digit directory ids, and the result is a fake staff id.

The organisation unit names in the directory pages have some quirks. For example the School of Computer Science and Engineering appears as "Computer Science and Engineering, School of". The following list indicates all of the strange structures in organisation names that need to be transformed into something more normal:

", School of"
", Department of"
", Graduate School of"
", Institute of"
", UNESCO Centre for"
", The Centre for"
", Centre for"
", The"

Also, you should apply the PHP library function html_entity_decode() to the name in order to map HTML special notion into normal characters (e.g. map "&" to "&").

Some directory pages have missing components. If the organisation or email or phone values are not in the page, then simply treat them as empty. If the name is missing from the directory page, then use the name that appears in the full staff list. If the title is missing, and if it is available in the full staff list, use the one from there. Since the staff id is generated from the directory id (which must be available so that you can read the directory page), this will never be missing. Summary: every entry in the full staff list should generate a line in the output which contains at least a staff id and a name. If you think that the expected file contains "incorrect" output, please let me know via the MessageBoard.

Hints: starting from the full staff list page, you should use the links contained there to visit all of the pages for individual staff members. To process the pages for individual staff members, you will need to work out cues in the HTML to identify where particular pieces of information occur. Some information may be missing; in that case, simply have an empty field (i.e. two adjacent tabs in the output).

Sample output for this task is available in the file:

/home/cs2041/web/07s2/ass/1/staff/expected

# the first ten lines of the above file ...

5341651	Debra Aarons	Dr	d.aarons@unsw.edu.au	53468	Linguistics Department
5404154	Peter Abakumoft	Mr	p.abakumoff@unsw.edu.au	0412 689 989	Banking and Finance
5080388	Hussein Abbass	Dr	h.abbass@adfa.edu.au	88158	School of Information Technology and Electrical Engineering-ADFA
5351683	Sofia Abdallah	Mrs	sofia@unsw.edu.au	54966	Environmental Studies, Institute of
5036113	Adam Abdool	Mr	a.abdool@unsw.edu.au	52102	School of Biotechnology and Biomolecular Sciences
5080359	Julian Abel	Dr	rjabel@unsw.edu.au	57091	School of Mathematics & Statistics
8000100	David Abello	Mr	d.abello@unsw.edu.au	57831	Social Policy Research Centre
7000007	Armin Aberle	Prof	a.aberle@unsw.edu.au	54031	School of Photovoltaic and Renewable Engineering
5403612	Samanthi Abeywardana	Mrs	samanthi@unsw.edu.au	21014	School of Medical Sciences
5392373	Tony Ablong	Mr	t.ablong@adfa.edu.au	88147	Information Communication and Technology Services

Warning: this file is around 550KB long and takes at least 40 seconds to generate.

A partially-completed template for this script is available. The template shows how you should open the top level staff directory. You should save this template to your assignment directory. If it's saved with with name staff.php.txt, make sure that you change its name to staff.php before you start playing with it.

3. courses.php (4 marks)

The script for this third task aims to collect information about UNSW courses from the online timetable system, to build up a collection of data to drive appropriate on-line course evaluation, with no manual setup required. Since we are collecting this information from a live web site, we will not collect data for every course at UNSW, but only for the courses mentioned in the NSS downloads from exercise 1.

Details about courses and classes for each course offering at UNSW is available via the UNSW timetable site:

http://www.timetable.unsw.edu.au/2007/

The above URL is presumably not intended to be accessed directly, since it simply gives a long directory listing of the HTML files for individual courses. Determining a URL to get timetable information for a course is easy. If the course is e.g. COMP2041, then the URL for its timetable page is:

http://www.timetable.unsw.edu.au/2007/COMP2041.html

Note that this contains information about all offerings of the course in 2007. For the purposes of this exercise, we are interested only in the semester two offering.

Exercise: write a PHP script called courses.php that scans the enrolment files from exercise 1 and the timetable pages for all courses mentioned in the enrolment files, and extracts for each course: the course code, the name of the course convenor for the 07s2 offering, the number of students enrolled in 07s2, and the types of classes offered in 07s2. The script should write one line for each course, where the code, convenor, enrolment count and class types are separated from each other by a single tab character. For the convenor, use the "Staff Contact" field in the timetable page rather than the "Instructor" field.

The first ten lines of the output should look like:

BINF1001	Mr BA Gaeta	26	Laboratory,Lecture
BINF2001	Dr ME Bain	14	Laboratory,Lecture
BINF3001	Mr BA Gaeta	29	Laboratory,Lecture,Tutorial-Laboratory
BINF4910	School Office	6	Thesis Research
BINF4911	School Office	5	Thesis Research
COMP1081	Dr GR Whale	21	Laboratory,Lecture
COMP1091	Dr AD Blair	7	Lecture,Tutorial-Laboratory
COMP1911	Dr AD Blair	241	Lecture,Tutorial-Laboratory
COMP1921	Dr M Pagnucco	323	Lecture,Tutorial-Laboratory
COMP2041	Dr JA Shepherd	202	Lecture,Tutorial-Laboratory

Note that class types should be given in alphabetical order (which is the order they appear in the timetable page), and should be comma-separated. Some classes include a sequence (e.g. "Lecture Sequence 1 of 2"); the sequence data should simply be dropped, as in the BINF[123]001 courses.

You should ignore the enrolment numbers in the pages under

http://www.timetable.unsw.edu.au/2007/

For the purposes of our exercise, the "official" enrolment data is contained in the files under

http://www.cse.unsw.edu.au/~cs2041/07s2/ass/1/enrolments/

No template is provided for this exercise. Use your students.php script as the basis for your courses.php script.

4. (Optional) Challenge (0 marks)

Exercise: Write a new script courses1.php that produces the same output as courses.php, except that it also includes the staff id of the course convenor, or "???" if the staff member cannot be recognised. The staff id should become the second column. You should try to minimise the number of "???" staff ids.

Note that you can't submit this using give. If you actually attempt it, email the solution directly to jas@cse.unsw.edu.au

Submission/Testing

Submit this assignment via the command: give cs2041 ass1 students.php staff.php courses.php

You must ensure that your .php files have no syntax errors. If I need to manually fix problems with your PHP code in order to run the testing, you will be fined via a 2 mark penalty.