High Performance Spatial Data Analytics
Location: Canberra, ACT
If you are residing in another state from that listed above, there may be some flexibility in the arrangement. Please contact the Business Development manager for more information.
Duration: 5 months
Keywords: Data analytics platforms, Software development, Python, Linux commands, Bash scripting, spatial analysis, Spark
Start date: March 2018
The National Earth and Marine Observations (NEMO) branch of Geoscience Australia has the responsibility of maintaining a large volume and broad range of earth science information products, including collections of satellite-derived and marine data. As the size and diversity of data grows, so does the complexity of processing and managing them effectively. The scale of the problem is such that manual processes are no longer effective, and so it is necessary to look to ways of structuring and storing our data in a manner that allows for automation.
These are ‘Big Data’ problems, both in terms of individual data sets and for collections as a whole. Geoscience Australia has a particular interest in handling large volumes of geographic point data (x, y and possibly z coordinates, also with time attribution), as these form the underpinning of their ability to generate land and seabed surface models. They are looking for an intern to help in progressing new technologies for processing and analysing high volumes of point information.
Research to be Conducted
Geoscience Australia is seeking assistance with implementing a highly scalable system for performing analytics with large collections of geographic points. Currently they are working with Apache Spark, in conjunction with ESRI’s Spatial Tools for Hadoop to carry out this work. Specific examples include processing high resolution digital elevation models (DEMs) from LIDAR, and bathymetric surfaces from multibeam soundings. The intern would help to design modules to support highly scalable spatial operations.
There is also considerable scope for using MLib (Spark’s machine learning library) to carry out processing on the large point data sets in order to perform tasks such as spatial pattern detection, identifying and classifying features, or detecting space-time trends. They also have a strong interest in ensuring delivery of data via web services, (e.g. using GeoMesa – www.geomesa.org).
Although Geoscience Australia has some interest in visualising high-volume point information, and experience in this area would be welcomed, this is not an area of primary interest.
Geoscience Australia makes extensive use of Amazon Web Services, and can provide access to compute instances, database services, and the necessary open source software to perform scalable computing. They also have access to a specialist AWS Cloud-Enablement team, and have a number of in-house development teams working with a variety of different programming languages and software kits.
For this project, we are seeking a candidate with:
- Experience working with highly scalable data analytics platforms – specifically Apache Spark, although additional experience with technologies such as Hadoop, HBase/BigTable, Zeppelin etc. would be an asset
- Some experience with software development will be essential
- Knowledge/experience with programming language Python, although when working specifically with Spark, we also make use of Scala and possibly Java
- Some familiarity with Linux commands and bash scripting will be essential, as GA principally operate using the Linux operating system, on EC2 instances provided by Amazon Web Services (AWS)
- Experience with using geospatial APIs associated with Spark would be highly desirable (e.g. GeoMesa, GeoTrellis, GeoWave, Spatial Tools for Hadoop), as would be basic familiarity with Geographic Information Systems (e.g. ArcGIS, QGIS)
- Also welcome experience with using MLib (Spark’s machine learning library) in conjunction with spatial analysis
The principal expected outcome will be the development of software modules and workflows to support high-performance processing of point data to fulfil the objectives described above. Reasonable documentation of the data model/software will be required (i.e. code alone would be insufficient), such that a reasonable developer would be able to understand the work and be able to continue development with it.
The end result will need to be integrated with existing government business workflows and will need to align with ongoing operations. A presentation of the results to the NEMO management team would also be valuable upon completion of the project.
The intern will receive $3,000 per month of the internship, usually in the form of stipend payments.
It is expected that the intern will primarily undertake this research project during regular business hours, spending at least 80% of their time on-site with the industry partner. The intern will be expected to maintain contact with their academic mentor throughout the internship either through face-to-face or phone meetings as appropriate.
The intern and their academic mentor will have the opportunity to negotiate the project’s scope, milestones and timeline during the project planning stage.
To participate in the APR.Intern program, all applicants must satisfy the following criteria:
- Be a PhD student currently enrolled at an Australian University
- PhD candidature must be confirmed
- Applicants must have the written approval of their Principal Supervisor to undertake the internship. This approval must be submitted at the time of application
- Have Australian Citizenship or Permanent Residency
- Internships are also subject to any requirements stipulated by the student’s and the academic mentor’s university
Applications currently OPEN
INT – 0382