This presentation will describe a technical framework, based on open source tools, for performing point-based exploratory data analysis and analytics over a range of viewing scales, together with examples of its usefulness. The technical development was divided into two phases: first, compiling the geographic data themes in a point grid designed to encode and access data; second, visualizing and analyzing the themes on a map by developing a Web application, offering on-demand visual and API services.
The primary goal was to enable exploratory data analysis of geographic phenomena, specifically using the statistical package R. Despite its strong analytical capabilities (1), R is arguably underutilized by the geographic community (2).
The second goal was to define an appropriate geographic database structure, and a procedure simplifying the often complex task of combining heterogeneous data sources into a uniform spatial resolution (3).
Database: Design and Compilation Workflow
To facilitate the data compilation procedure to populate a PostGIS database, an automated workflow for loading and aggregating manageable units of data was developed. Initially, a regular grid, serving as sampling points, is defined at a specified resolution. Data is then processed according to thematic type (e.g population density) and assigned as field values to their closest point (e.i. polygon intersect). Finally, the data values are aggregated (averaged) into a range of resolutions (map scales).
Web service: Design and Architecture
A web application, based on OpenCPU (4), integrating the PostGIS database and the statistical package R was developed. OpenCPU facilitates the calling of R functions through AJAX thereby effectively streaming database points, via R, to a browser client. For example: calculating the linear correlation (cor) between two data themes.
Points are extracted at a given resolution (map scale), encoded as JSON and rendered on a map interface. The overall load is minimized by coordinating data resolution to zoom level, such that an optimal number of points are being processed/transferred as the user zooms the map.
The system was deployed on a modest Google Cloud Compute Engine Instance (2 vCPUs, 7.5 GB).
Use case: Income Inequality Study
Openly available data including census data for the state of California were captured at a 250 meter resolution, loaded into a database and configured as a custom OpenCPU server app.
A number of areas with a large gradient in income per capita were identified by applying a nearest neighbor filtering function at varying distances.
These areas were further analyzed by applying linear correlation methods to explore relationships with income, race and other variables.
- Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R. New York, Springer.
- Robin Lovelace, Rachel Oldroyd Centre for Spatial Analysis and Policy, University of Leeds. Geomatics Workbooks n° 12 – "FOSS4G Europe Como 2015" Teaching R as a GIS: problems, solutions and lessons learned. http://geomatica.como.polimi.it/workbooks/n12/FOSS4G-eu15_submission_130...
- Gloria Re Calegari, Irene Celino CEFRIEL – Politecnico di Milano. Geomatics Workbooks n° 12 – "FOSS4G Europe Como 2015" A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data. http://geomatica.como.polimi.it/workbooks/n12/FOSS4G-eu15_submission_110...
- Ooms, Jeroen. "The OpenCPU system: Towards a universal interface for scientific computing through separation of concerns." arXiv preprint arXiv:1406.4806 (2014). https://www.opencpu.org/