Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Java code referenced on this page is in the public domain, and currently (June 2014) resides in the MATSim playground southafrica. As with all MATSim code, it is released under the GNU General Public License as published by the Free Software Foundation. Refer to the MATSim website on how to get the code. 

...

The Census data in South Africa is quite rich. On the one hand we have the aggregate Community Profile data, which is aggregated to a certain zonal level. You might know how many people in a subplace fall within a certain age group; how many males and females there are in the subplace; or how many households there are living in informal backyard dwellings. But the data remains aggregate at the subplace level. You do not know how if a particular person, say Joe, is a working Black African man, aged 35, in a household with 6 members. And this is the level of detail we need for a MATSim population.

...

For each study area we extract two tables from the Community Profile data. The first is taken from will be the Dwellings database at subplace level and is used as control totals for households. Each row represents a subplace, and the columns/fields represent household income class codes. For example, here are the household control totals for the first two subplaces of Gauteng.

...

SP_CODE
i1
i2
i3
i4
i5
i6
i7
i8
i9
i10
i11
i12
760001001
717
256
336
554
764
880
741
409
92
9
2
6
760002001
472
164
279
485
669
559
373
145
28
3
1
3
...
...
...
...
...
...
...
...
...
...
...
...
...

 

We see that there are is a total of 4766 households (row sum) in subplace 760001001, which is Stretford in the Orange Farm township of Sedibeng, South of Johannesburg. For the individuals we use race and gender as control totals, and this data was taken from the Family database of the Community Profile. Again, using Gauteng as an example, here are the individual control totals first two subplaces.

...

Here we see that there are 17,381 individuals in the Stretford subplace, giving us an average of 3.65 persons per household. The two tables were then joined to form a single table that contained both household and individual control totals. The demographic codes used is made up of two digits. The first represents the population group: Black African (1); Coloured (2); Indian/Asian (3) and White (4). The second is gender: Male (1) and Female (2). The two tables were then joined to form a single table that contained both household and individual control totals. 

Reference sample

Initially we started by parsing the complete 10% sample into a MATSim population, using the class playgroundclass playground.southafrica.population.census2011.Census2011SampleParser. Then, for each of the study areas, we extracted those records that fell within the district of the study area. For this we used the class playground.southafrica.population.census2011.DistrictPopulationExtractor2011 in  in MATSim. Finally, the population was written to tab-delimited text file using playgroundusing playground.southafrica.population.census2011.IpfWriter2011. The code of those classes are all well-documented, and the interested reader may find more detail in the code if required. Here is and excerpt of the first few records of the reference sample for eThekwini:

...

Here we see the records of two households. The field HHNR represents the household's unique number; PNR is the unique person number; HHS is the number of household members; HT and MDT is  are the housing and main dwelling type, respectively; POP is the population group of the individual; INC is the income class for the household; PNRHH is the individual's person number within the household; AGE is the individual's age; GENDER the individual's gender (1 represents male, and 2 represents female); REL is the individual's role in the household; EMPL is the current employment status; and SCH the current level of school being attended. 

...

The entropy-maximisation algorithm of Kirill Müller's MLIPF implementation was used. The implementation is in R (R Core Team, 2014), and we provide the multi-thread script we use here, with the associated config.xml file is provided here. The script can/should be run from the command line as it takes four arguments, and in this order:

  1. the working directory where the script will find a folder named ../data2010/ in  in which the control totals and reference samples for the specific area is found;
  2. the area name on which to execute the MLIPF;
  3. the number of cores to use. The script was implemented making use of parallelisation since this is a fairly computationally burdensome exercise; and
  4. a boolean argument indicating if only the first zone should be estimated (TRUE), or all of the zones in the control totals file (FALSE) . This was just for debugging purposes, but you might find it useful to just play around.

The output is a folder containing two files per zone, one containing the final weights for each record in the reference sample, and a simulated population from the reference sample using the those weights. We provide retained the weights since one may wish to generate multiple populations for evaluation purposes in future. A single, three-column file is also written out. The first field shows the subplace code, the second field indicates if whether the entropy-maximisation converged successfully or not, and the third field provides the final/best residual.

...

   <object id="15">
      <attribute name="age" class="java.lang.Integer">39</attribute>
      <attribute name="gender" class="java.lang.String">Male</attribute>
      <attribute name="householdId" class="java.lang.String">4</attribute>
      <attribute name="population" class="java.lang.String">Coloured</attribute>
      <attribute name="relationship" class="java.lang.String">Head</attribute>
      <attribute name="school" class="java.lang.String">None</attribute>
   </object>
   <object id="16">
      <attribute name="age" class="java.lang.Integer">35</attribute>
      <attribute name="gender" class="java.lang.String">Female</attribute>
      <attribute name="householdId" class="java.lang.String">4</attribute>
      <attribute name="population" class="java.lang.String">Coloured</attribute>
      <attribute name="relationship" class="java.lang.String">Partner</attribute>
      <attribute name="school" class="java.lang.String">None</attribute>
   </object>
   <object id="17">
      <attribute name="age" class="java.lang.Integer">17</attribute>
      <attribute name="gender" class="java.lang.String">Male</attribute>
      <attribute name="householdId" class="java.lang.String">4</attribute>
      <attribute name="population" class="java.lang.String">Coloured</attribute>
      <attribute name="relationship" class="java.lang.String">Child</attribute>
      <attribute name="school" class="java.lang.String">School</attribute>
   </object>
   <object id="18">
      <attribute name="age" class="java.lang.Integer">6</attribute>
      <attribute name="gender" class="java.lang.String">Female</attribute>
      <attribute name="householdId" class="java.lang.String">4</attribute>
      <attribute name="population" class="java.lang.String">Coloured</attribute>
      <attribute name="relationship" class="java.lang.String">Child</attribute>
      <attribute name="school" class="java.lang.String">School</attribute>
   </object>
   <object id="19">
      <attribute name="age" class="java.lang.Integer">4</attribute>
      <attribute name="gender" class="java.lang.String">Male</attribute>
      <attribute name="householdId" class="java.lang.String">4</attribute>
      <attribute name="population" class="java.lang.String">Coloured</attribute>
      <attribute name="relationship" class="java.lang.String">Child</attribute>
      <attribute name="school" class="java.lang.String">NotApplicable</attribute>
   </object>

 

This these These richly described population files are quite sizeable, so we provided 10% samples for each area here. To ensure that the sampling retains the household structure, we implemented executed the sampling using the class playground.southafrica.population.HouseholdSampler

...

The second subpopulation provided was not originally within the scope of the Treasury project, but we thought it appropriate to add it anyway. To understand the activity chains (plans) of commercial vehicles, we studied the detailed GPS records of more than 40,000 commercial vehicles, courtesy of Digicore's Ctrack vehicle tracking and fleet management product offering. The details of the study was published by Joubert & Axhausen (2011). Subsequently, the activity chains were used to generate a path-dependent complex network, of which an early version was written up in Joubert and Axhausen (2013). Using specific density-based clustering parameters (radius of 30 metres and a minimum number of points, pmin, of 15), we generated a national population of commercial vehicles that accounts for 10% of all registered commercial vehicles in South Africa. A brief video of the resulting population sample is available on YouTube. Here is a static view showing the national extent and intensity of commercial vehicle activities in South Africa.

...