Hướng dẫn spatial data python
Attention Show Finnish university students are encouraged to use the CSC Notebooks platform. Others can follow the lesson interactively using Binder. Check the rocket icon on the top of this page. In the first week, we will take a quick tour to Python’s (spatial) data science ecosystem and see how we can use some of the fundamental open source Python packages, such as:
As you can see, we won’t use any GIS software for doing the programming (such as ArcGIS/arcpy or QGIS), but focus on learning the open source packages that are independent from any specific software. These libraries form nowadays not only the core for modern spatial data science, but they are also fundamental parts of commercial applications used and developed by many companies around the world. Note If you have experience working with the Python’s spatial data science stack, this tutorial probably does not bring much new to you, but to get everyone on the same page, we will all go through this introductory tutorial. Contents:
Fundamental library: Geopandas¶In this course, the most
often used Python package that you will learn is geopandas. Geopandas makes it possible to work with geospatial data in Python in a relatively easy way. Geopandas combines the capabilities of the data analysis library pandas with other packages like shapely and
fiona for managing spatial data. The main data structures in geopandas are Reading and writing spatial data¶Next we will learn some of the basic functionalities of geopandas. We have a couple of GeoJSON files
stored in the We can read the data easily with import geopandas as gpd # Filepath fp = "data/buildings.geojson" # Read the file data = gpd.read_file(fp) # How does it look? data.head()
5 rows × 34 columns As we can see, the GeoDataFrame contains various attributes in separate columns. The
From here, we can see that our data is indeed a GeoDataFrame object with 486 entries and 34 columns. You can also get descriptive statistics of your data by calling:
In this case, we didn’t have many columns with numerical data, but typically you have numeric values in your dataset and this is a good way to get a quick view how the data look like. Naturally, as the data is spatial, we want to visualize it to understand the nature of the data better. We can do this easily with Now we can see that the data indeed represents buildings (in central Helsinki). Naturally we can also write this data to disk. Geopandas supports writing data to various data formats as well as to PostGIS which is the most widely used open source database for GIS. Let’s write the data as a Shapefile to the same # Output filepath outfp = "data/buildings_copy.shp" data.to_file(outfp) Retrieving data from OpenStreetMap¶Now we have seen how to read spatial data
from disk. OpenStreetMap (OSM) is probably the most well known and widely used spatial dataset/database in the world. Also in this course, we will use OSM data frequently. Hence, let’s see how we can retrieve data from OSM using a library called pyrosm. With Note In case you want to extract OSM data from smaller areas, e.g. using a buffer of 2 km around a specific location, we recommend using OSMnx library, which is more flexible in terms of specifying the area of interest. Let’s see how we can download and extract OSM data for Helsinki
Region using from pyrosm import OSM, get_data # Download data for Helsinki fp = get_data("helsinki") # Initialize the reader object for Helsinki osm = OSM(fp) Downloaded Protobuf data 'Helsinki.osm.pbf' (28.79 MB) to: '/tmp/pyrosm/Helsinki.osm.pbf' As a first step, we downloaded the data for “Helsinki” using the OSM is a “database of the world”, hence it contains a lot of information about different things. With
Let’s see how we can read all the buildings from Helsinki Region: buildings = osm.get_buildings()
5 rows × 40 columns Let’s check how many buildings did we get: Okay, so there are more than 150 thousand buildings in the Helsinki Region. Naturally, we would like to see them on a map. Let’s plot our data using Great! As a result we got a map that seems to look correct. Reprojecting data¶As we can see from the previous maps that we have produced, the coordinates on the x and y axis hint that our geometries are represented in decimal degrees (WGS84). In many cases, you want
to reproject your data to another CRS. Luckily, doing that is easy with
As a result, we get information about the CRS, and we can see that the data is indeed in WGS84. We can also see that the EPSG code for the CRS is 4326. We can easily reproject our data by using a method
projected = buildings.to_crs(epsg=3067) projected.crs
As we can see, now we have an orig_geom = buildings.loc[0, "geometry"] projected_geom = projected.loc[0, "geometry"] print("Orig:\n", orig_geom, "\n") print("Proj:\n", projected_geom) Orig: POLYGON ((24.8212885 60.1871792, 24.8216351 60.1871237, 24.8218626 60.1870873, 24.8218641 60.1870934, 24.8218654 60.1870987, 24.8219228 60.1870952, 24.8219186 60.1870783, 24.8219949 60.1870661, 24.8225411 60.1869791, 24.8224862 60.186896, 24.8224626 60.1868996, 24.8224423 60.1869026, 24.8223976 60.1867789, 24.8221329 60.1867998, 24.8221083 60.1867761, 24.8220836 60.1867524, 24.822336 60.186716, 24.8223039 60.186662, 24.8223248 60.1866583, 24.8223455 60.1866546, 24.8222782 60.1865828, 24.8222717 60.1865839, 24.8217847 60.1866948, 24.821782 60.1866868, 24.821718 60.1866924, 24.821721 60.1867002, 24.8217239 60.1867078, 24.8211387 60.1867604, 24.8211339 60.1867474, 24.8210732 60.1867528, 24.8210758 60.18676, 24.8210779 60.1867659, 24.8204964 60.1868181, 24.8204917 60.186805, 24.8204286 60.1868107, 24.8204316 60.1868189, 24.8204333 60.1868238, 24.8203009 60.1868357, 24.8203177 60.1868814, 24.8203355 60.18688, 24.820371 60.1869699, 24.8204695 60.186961, 24.8204728 60.18697, 24.8204764 60.18698, 24.820483 60.1869963, 24.8205879 60.1872905, 24.8206157 60.1873401, 24.8206263 60.1873572, 24.8209968 60.1872991, 24.8209637 60.1872465, 24.8209549 60.1872326, 24.821232 60.1871882, 24.8212332 60.1871951, 24.8212348 60.1872038, 24.8212923 60.1872013, 24.8212885 60.1871792)) Proj: POLYGON ((379178.3725997981 6674250.711355622, 379197.3848824766 6674243.898270451, 379209.8642355002 6674239.429592708, 379209.9698097793 6674240.105962586, 379210.0613564497 6674240.693634215, 379213.2308787974 6674240.198960315, 379212.9359337318 6674238.325163864, 379217.1213550152 6674236.827341885, 379247.08437153 6674226.142463202, 379243.7353683548 6674216.991336721, 379242.4401472288 6674217.435292421, 379241.3256829249 6674217.806414308, 379238.3930495286 6674204.116628114, 379223.7941370049 6674206.927641303, 379222.343184846 6674204.334118548, 379220.8866864759 6674201.740779295, 379234.7467383571 6674197.226635681, 379232.768672102 6674191.273522315, 379233.9138384764 6674190.823369212, 379235.0479165341 6674190.373582317, 379231.0528703418 6674182.503181034, 379230.6965310586 6674182.63753462, 379204.1032204548 6674195.875016209, 379203.9241322995 6674194.989314756, 379200.3963642038 6674195.729866023, 379200.5913512165 6674196.592752348, 379200.7800590387 6674197.433555618, 379168.5282306731 6674204.360432023, 379168.2143285879 6674202.92192481, 379164.8687999398 6674203.634204664, 379165.0394128716 6674204.431023199, 379165.1775265622 6674205.084027647, 379133.129487243 6674211.959913267, 379132.8207482077 6674210.510092845, 379129.343272631 6674211.260198221, 379129.5397465113 6674212.16761301, 379129.6520130131 6674212.710018366, 379122.355162651 6674214.277255361, 379123.4546156952 6674219.334283684, 379124.4363452866 6674219.145831282, 379126.735067006 6674229.089415575, 379132.163420909 6674227.918240647, 379132.3794672409 6674228.914170477, 379132.6158224508 6674230.020881242, 379133.0416649691 6674231.823479696, 379139.9390695336 6674264.384770593, 379141.6626842505 6674269.855854501, 379142.31322685 6674271.740195557, 379162.640842104 6674264.593717466, 379160.6123887796 6674258.79833438, 379160.0734109332 6674257.266952087, 379175.2732003523 6674251.816728769, 379175.3650882723 6674252.582710911, 379175.4857679585 6674253.548355347, 379178.6644948177 6674253.16479852, 379178.3725997981 6674250.711355622)) As we can see the coordinates that form our Polygon has changed from decimal degrees to meters. Let’s see what happens if we just call the geometries: As you can see, we can draw the geometry directly in the screen, and we can easily see the difference in the shape of the two geometries. The shapely.geometry.polygon.Polygon These shapely geometries are used as the underlying data structure in most GIS packages in Python to present geometrical information. Shapely is fundamentally a Python wrapper for GEOS which is widely used library (written in C++) under the hood of many GIS softwares such as QGIS, GDAL, GRASS, PostGIS, Google Earth etc. Currently, there is ongoing work to vectorize all the GEOS functionalities for Python and bring those eventually into Shapely which will greatly boost the performance of all geometry related operations in Python ecosystem (approaching the same efficiency as PostGIS). Some of these improvements can already be found under the hood of latest version of geopandas. Calculating area¶One thing that is quite often interesting to know when working with spatial data, is the projected["building_area"] = projected.area projected["building_area"].describe() count 153642.000000 mean 290.498895 std 951.149675 min 0.000000 25% 69.514452 50% 143.880110 75% 264.057998 max 81335.830442 Name: building_area, dtype: float64 We calculated the area by calling Spatial join¶A commonly needed GIS functionality, is to be able to merge information between two layers using location as the # Read Points of Interest using the same OSM reader object that was initialized earlier restaurants = osm.get_pois(custom_filter={"amenity": ["restaurant"]}) restaurants.plot()
As we can see, the OSM for Helsinki Region contains 1388 restaurants altogether. As you can probably guess, the OSM data is far from “perfect” in terms of the quality of the restaurant listings. This is due to the voluntary nature of adding information to the OpenStreetMap, and the fact restaurants (as well as other POI features) are highly dynamic by nature, i.e. new amenities open and close all the time, and it is challenging to keep up to date with those changes (this is a challenge even for commercial companies). Joining data from buildings to the restaurants can be done easily using # Join information from buildings to restaurants join = gpd.sjoin(restaurants, buildings) # Print column names print(join.columns) # Show rows join Index(['changeset_left', 'tags_left', 'lon', 'id_left', 'timestamp_left', 'version_left', 'lat', 'addr:city_left', 'addr:country_left', 'addr:housenumber_left', 'addr:housename_left', 'addr:postcode_left', 'addr:place_left', 'addr:street_left', 'email_left', 'name_left', 'opening_hours_left', 'operator_left', 'phone_left', 'ref_left', 'url_left', 'website_left', 'amenity_left', 'bar', 'cafe', 'internet_access_left', 'office_left', 'pub', 'restaurant', 'source_left', 'start_date_left', 'wikipedia_left', 'geometry', 'osm_type_left', 'building_left', 'building:levels_left', 'index_right', 'addr:city_right', 'addr:country_right', 'addr:full', 'addr:housenumber_right', 'addr:housename_right', 'addr:postcode_right', 'addr:place_right', 'addr:street_right', 'email_right', 'name_right', 'opening_hours_right', 'operator_right', 'phone_right', 'ref_right', 'url_right', 'website_right', 'building_right', 'amenity_right', 'building:flats', 'building:levels_right', 'building:material', 'building:min_level', 'building:use', 'craft', 'height', 'internet_access_right', 'landuse', 'levels', 'office_right', 'shop', 'source_right', 'start_date_right', 'wikipedia_right', 'id_right', 'timestamp_right', 'version_right', 'tags_right', 'osm_type_right', 'changeset_right'], dtype='object')
1368 rows × 76 columns # Visualize the data as well join.plot() As we can see from the above, now we have merged information from the buildings to restaurants. The geometries of the Selecting data using sjoin¶One handy trick and efficient trick for spatial join is to use it for selecting data. We can e.g. select all buildings that intersect with restaurants by conducting the spatial join other way around, i.e. using the buildings as the left GeoDataFrame and the restaurants as the right GeoDataFrame: # Merge information from restaurants to buildings (conducts selection at the same time) join2 = gpd.sjoin(buildings, restaurants, how="inner", op="intersects") join2.plot() As we can see (although the small building geometries are a bit poorly visible), the end result is a layer of buildings which intersected with the restaurants. This is a straightforward way to conduct simple spatial
queries. You can specify with
Plotting data with matplotlib¶Thus far, we haven’t really made any effort to make our maps visually appealing. Let’s next see how we can adjust the appearance of our map, and how we can visualize many layers on top of each other. Let’s start by visualizing the buildings that we selected earlier and adjust a bit of the
colors and figuresize. We can adjust the color of polygons with ax = join2.plot(facecolor="red", figsize=(12,12)) Index(['addr:city_left', 'addr:country_left', 'addr:full', 'addr:housenumber_left', 'addr:housename_left', 'addr:postcode_left', 'addr:place_left', 'addr:street_left', 'email_left', 'name_left', 'opening_hours_left', 'operator_left', 'phone_left', 'ref_left', 'url_left', 'website_left', 'building_left', 'amenity_left', 'building:flats', 'building:levels_left', 'building:material', 'building:min_level', 'building:use', 'craft', 'height', 'internet_access_left', 'landuse', 'levels', 'office_left', 'shop', 'source_left', 'start_date_left', 'wikipedia_left', 'id_left', 'timestamp_left', 'version_left', 'geometry', 'tags_left', 'osm_type_left', 'changeset_left', 'index_right', 'changeset_right', 'tags_right', 'lon', 'id_right', 'timestamp_right', 'version_right', 'lat', 'addr:city_right', 'addr:country_right', 'addr:housenumber_right', 'addr:housename_right', 'addr:postcode_right', 'addr:place_right', 'addr:street_right', 'email_right', 'name_right', 'opening_hours_right', 'operator_right', 'phone_right', 'ref_right', 'url_right', 'website_right', 'amenity_right', 'bar', 'cafe', 'internet_access_right', 'office_right', 'pub', 'restaurant', 'source_right', 'start_date_right', 'wikipedia_right', 'osm_type_right', 'building_right', 'building:levels_right'], dtype='object') Now with the bigger figure size, it is already a bit easier to see the selected buildings that have a restaurant inside them (according OSM). Let’s color our buildings based on the building type. Hence, each building type category will receive a different color: ax = join2.plot(column="building_left", cmap="RdYlBu", figsize=(12,12), legend=True) Now we used the parameter # Zoom into city center by specifying X and Y coordinate extent # These values should be given in the units that our data is presented (here decimal degrees) xmin, xmax = 24.92, 24.98 ymin, ymax = 60.15, 60.18 # Plot the map again ax = join2.plot(column="building_left", cmap="RdYlBu", figsize=(12,12), legend=True) # Control and set the x and y limits for the axis ax.set_xlim([xmin, xmax]) ax.set_ylim([ymin, ymax]) Now it is much easier to see how the building types are distributed in the city. To get a bit more context to our visualizaton. Let’s also add roads with our buildings. To do that we first need to extract the roads from OSM: # Get roads (retrieves walkable roads by default) roads = osm.get_network() Now we can continue and add the roads as a layer to our visualization with gray line color: # Zoom into city center by specifying X and Y coordinate extent # These values should be given in the units that our data is presented (here decimal degrees) xmin, xmax = 24.92, 24.98 ymin, ymax = 60.15, 60.18 # Plot the map again ax = join2.plot(column="building_left", cmap="RdYlBu", figsize=(12,12), legend=True) # Plot the roads into the same axis ax = roads.plot(ax=ax, edgecolor="gray", linewidth=0.75) # Control and set the x and y limits for the axis ax.set_xlim([xmin, xmax]) ax.set_ylim([ymin, ymax]) Perfect! Now it is much easier to understand our map because the roads brought much more context (assuming you know Helsinki). We ware able to add the roads to the same map by specifying the # Import matplotlib pyplot and use a dark_background theme import matplotlib.pyplot as plt plt.style.use("dark_background") # Zoom into city center by specifying X and Y coordinate extent # These values should be given in the units that our data is presented (here decimal degrees) xmin, xmax = 24.92, 24.98 ymin, ymax = 60.15, 60.18 # Plot the map again ax = join2.plot(column="building_left", cmap="RdYlBu", figsize=(12,12), legend=True) # Plot the roads into the same axis ax = roads.plot(ax=ax, edgecolor="gray", linewidth=0.75) # Control and set the x and y limits for the axis ax.set_xlim([xmin, xmax]) ax.set_ylim([ymin, ymax]) Awesome! Now we have a nice dark theme with our map. With this information you should be able to get going with Exercise 1. Further information¶For further information, we recommend checking the materials from Automating GIS Processes -course (GIS things) and Geo-Python -course (intro to Python and data analysis with pandas). In addition, we always recommend to check the latest documentation from the websites of the libraries:
|