# Dynamic mapping with Shapefiles and python

In a two days coding challenge over workers day holiday, I decided to play with maps as means to plot information. What I decided to are use are the widely available *shapefiles* and python *matplotlib* packages, using mainly *pandas* and *numpy* to analyse data.

You can find everything in this git repository.

First of all you can start familiarising with shapefiles by plotting the pure patches, as in function mP in mapPlot.py.

You need these packages:

import shapefile from matplotlib.patches import Polygon import matplotlib import numpy as np

and you start by reading the file `flnm`

in this way:

sf = shapefile.Reader(flnm) shapes = sf.shapes() Nshp = len(shapes)

records all the properties of your map, in particular the shapes attribute (records attribute contains the dataset that varies from file to file) and the length of the shapefile. And you are just need to cycle over the patches to build a Polygon array in matplotlib,

ptchs = [] for nshp in xrange(Nshp): pts = np.array(shapes[nshp].points) prt = shapes[nshp].parts par = list(prt) + [pts.shape[0]] for pij in xrange(len(prt)): ptchs.append(Polygon(pts[par[pij]:par[pij+1]]))

that can be finally plotted in the usual matplotlib way,

fig = plt.figure() ax = fig.add_subplot(111) ax.add_collection(PatchCollection(patches,facecolor='0.75', edgecolor='w', linewidths=.2)) ax.axis('auto'); ax.axis('off') plt.show()

```
```

to get the final map.

However, just juxtaposing patches is not ideal. Libraries like *basemap* have inclusive and sophisticated routines to correcting for coma, geodetic aberrations and have more interesting options regarding plotting scales, legends…etc…

One has to import a little bit more stuff,

import pandas as pd import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon from matplotlib.collections import PatchCollection from descartes import PolygonPatch import fiona

but the generalization and flexibility gained is worth it.

Use mP_Basemap_ _open the file using _fiona,_

shp = fiona.open(flnm) bds = shp.bounds

*fiona* enables to extract the bounds of map simply with .bounds command, then the bounds can be used as latitude and longitude extremes in *basemap* plotting routine

m = Basemap( projection='tmerc', lon_0=-2., lat_0=49., ellps = 'WGS84', llcrnrlon=bounds[0] - extra * (bounds[2]-bounds[0]), llcrnrlat=bounds[1] - extra + 0.01 * (bounds[3]-bounds[1]), urcrnrlon=bounds[2] + extra * (bounds[2]-bounds[0], urcrnrlat=bounds[3] + extra + 0.01 * (bounds[3]-bounds[1]), lat_ts=0, resolution='i', suppress_ticks=True)

WGS84 is the chosen geodetic coordinate, have a look on the README.md and on the script inside the `Map/`

directory for further information on how to convert coordinate systems.

After this you can build a dataframe containing not only frames, but also the data of the shapefile, for example the ward name or properties in the case the map is containing those data,

# set up a map dataframe df_map = pd.DataFrame({ 'poly': [Polygon(xy) for xy in m.map] ,'properties': [ward['name'] for ward in m.map_info] }) df_map['area_m'] = df_map['poly'].map(lambda x: x.area) df_map['area_km'] = df_map['area_m'] / 10000. # draw ward patches from polygons df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch( x, fc='#555555', ec='#787878', lw=.75, alpha=.9, zorder=4))

to plot, use `df_map['patches']`

which contains the array of patches.

You can use the ratio between `bounds`

to have a correct figure scaling, and the `ellps`

guarantees you the correct coordinate representation.

Have a look in the function files on github for the complete functions.

After having obtained the dataframe, you can start by using it. You can `print m.map_info`

to realise what type of labels are inside the map properties. More in general I used to import the dictionary keys related to `map_info`

, inside a dataframe in this fashion

for dicti in m.map_info: temp_df = temp_df.append(pd.Series(dicti),ignore_index=True)

and with `colnames = list(df.columns.values`

you can eventually get a full list of column names to elaborate and automatise your shapefile analysis.

The map that I use has postcode sectors throughout the whole England, in the column `'label'`

. It joins the DataFrame of this map with an imported DataFrame of data, which has postcode sectors in the column `'Sector'`

, keeping only the data that appears in both

i1 = temp_df.set_index('label').index i2 = df.set_index('Sector').index temp_df = temp_df[i1.isin(i2)] #Select only the part that corresonds to the imported dataframe of data df_map = pd.concat([df_map, temp_df], axis=1, join='inner')

This new map concerns only the part of the map that is also in the imported dataset.

Using Jenks natural breaks algorithm to color the map, and using the dataset of mortgages selected for postcodes over the Greater London area, the result is the following:

The data has been taken from CML in this case, the .xlsx spreadsheet has been analysed with `openpyxl`

in order to extract a DataFrame containing only rows including a name (e.g. London or Manchester) effectively isolating a geographic area together with the intersection above.

The spreadsheet contains then mortgages totals for every quarters between 2012 and 2016 Q3. One can use this data to synthesise an index indicating the market health.

With only 16 datapoints in the time series any complicated time series analysis (i.e. any type of self regression or machine learning approach based on the time-series) is a steep road. I started by considering the most elementary condition of the market: an ordinary least squares (OLS) for a linear regression.

Many of the most stable results for the best sectors and districts, were, in fact, very well approximating a growing market.

Naively, best conditions you can have is a market which is already stable and has a good capital to start with, and then grows in a stable and sustainable fashion. If an area is deviating from this regime is more risky to invest in, so if the linear fit has an error on slope an intercept these can be simple accounted by subtraction in the index.

For this reason the first index I tried to build out of these data is simply given by the parameters of the OLS: slope , intercept , and errors on slope and intercept .

The index follows the OLS, $$ x = \textrm{sign} (m - \langle m \rangle) \frac{(m - \langle m \rangle)^2}{\langle \sigma_m \rangle^2} + \textrm{sign} (q -\langle q \rangle) \frac{(q - \langle q \rangle)^2}{\langle \sigma_q \rangle^2} -\frac{(\sigma_m - \langle \sigma_m \rangle)^2}{\langle \sigma_m \rangle^2} - \frac{(\sigma_q-\langle \sigma_q \rangle)^2}{\langle \sigma_q \rangle^2} , $$ where indicates the average value of the slope across the dataset.

This model is not deprived of defects and deficiencies, but is good first approximation and try to this dataset yielding sensible results over a vast geographical area.

The final result (with a primitive scale and the correct aspect ratio enabled) for the London area is in the following figure.

Again, I remind you about the github repository for the complete and hopefully working code (I understand it needs refactoring and debugging, is only a two days challenge) and further details. Please write me for any comment or insult.

*Andrea Idini.*