In [1]:
from IPython.core.display import Image

CCCma Python Group Seminar

23-Jun-2014

Doug Latornell
Earth, Ocean & Atmospheric Sciences, UBC

http://douglatornell.ca

@dlatornell

http://bit.ly/DJL-CCCma-23Jun2014

  • Version Control

  • Python Tools and Glue

  • Visualization of Model Results

Part 1 - Version Control

  • What is it?
  • Why use it?
  • What for?
  • Key Concept
  • History of Tools, and Their Pros and Cons

Git - Distributed Version Control

  • Key Disciplines
  • GUIs

Collaboration via Distributed Version Control

What is Version Control (VC)?

Use software tools to keep a running record of 1 or more files.

Why You Should Use VC

  • Lets you revert to earlier versions of your work
  • Provides a record of what changed when
  • Lets you mark significant points in time
  • Allows you to play "what-if?"
  • Facilitiates organized collaboration
    • with your future self, as well as with other people
  • Sync files among computers; laptop to desktop, desktop to HPC, ...
  • "Provenance and change tracking are key to the scientific method; version control is the best actual way to do it" - Titus Brown

What You Should Use VC For

  • Model Code
  • Matlab/IDL/Python/R Scripts
  • Plotting Scripts
  • Processed Data Files & Scripts That Made Them
  • Thesis
  • Papers
  • Reports
  • ToDo List

Key Concept

  • Data differencing

  • Unix diff and patch utilities

  • Given a file, and a complete set of diffs between one state and another, any intermediate state for which there is a diff can be reconstructed.

Ad hoc:

Version Control Tools

http://en.wikipedia.org/wiki/Revision_control

Ad hoc:

thesis2.tex, JFM-21mar.doc, pooh.txt, ...

Mists of time...

  • SCCS
  • RCS

Proprietary:

  • Visual SourceSafe
  • Perforce
  • BitKeeper

Version Control Tools

http://en.wikipedia.org/wiki/Revision_control

Old School (Client/Server):

  • CVS (Concurrent Version System)
  • SVN (Subversion)

Distributed & Open Source:

  • GNU arch
  • Darcs
  • Monotone
  • Bazaar

  • Git

  • Mercurial

Pros and Cons

Ad hoc

  • Easy to do, if you think of it
  • Works best if you have a system
  • stuff1.txt, stuff2.f90, stuff4.m probably isn't a good enough system
  • Hard to provide your future self with enough metadata

Client/Server

  • Good for centrally controlled project; e.g. NEMO, ROMS, ...
  • Work required to set up and administer
  • Committing changes feels like a big deal
  • Requires network connection

Distributed

  • Almost zero set up
  • No network required
  • Every copy of a repository is a full backup
  • Scalable to big projects
  • Usable for central control

Git Commands to Start a Project

$ git init myproject

$ cd myproject

add/create some files

$ git add

$ git commit -m"Initial commit."

Key Disciplines

Commit Early, Commit Often

  • Small incremental changes are easier to understand
  • You can't revert to a diff that doesn't exist

Make Commit Messages Informative

git Commands to See What's Going On

Print revision history of files or whole repository:

$ git log

Show differences between revisions:

$ git diff

Show status of files (e.g. modified, added, removed, missing, not tracked)

$ git status

N.B. There are lots of options for each command. See git help command

Tags

Tags are symbolic names for specific revisions in the repository. Most often you assign a tag to the current revision (HEAD) to mark a significant event.

Tag the current revision as jgr_1:

$ git tag --anotate -m"1st submission to JGR." jgr_1

Print a list of the tags in the repository:

$ git tag --list

Git Commands to Work in a Shared Project

$ git clone project_repo

$ cd project

edit some files

$ git add file1 file2 file3

$ git commit -m"My changes."

$ git push

project_repo can be a path, or a URL (http, https, ssh)

Collaboration

  • User to user
  • Shared repos on a private or public server
  • Web services like Bitbucket or GitHub

Bitbucket and GitHub

https://bitbucket.org

  • Mercurial or Git
  • Free unlimited public repos
  • Free private repos with 5-8 collaborators; unlimited with educational identity
  • Per-repo issue trackers, wikis
  • Forking, pull requests
  • GUI clients
  • Getting started: Bitbucket 101

https://github.com

  • Git only
  • Free unlimited public repos
  • Monthly fee for private repos; maybe reduces with educational identity
  • Per-repo issue trackers, wikis
  • Forking, pull requests
  • GUI clients
  • More buzz in open source communities
  • Getting started: GitHub Bootcamp

Part 2 - Python Tools and Glue

  • Python
  • The scientific Python stack

Python as Glue

  • Command Processors for the SOG 1-D and Salish Sea NEMO models
  • Code repo maintenance tool for NEMO codebase
  • Automated testing of SOG 1-D model
  • Packages of functions to re-use and shared use
  • Running the SOG 1-D model in real-time forecast mode - SoG-bloomcast

Python

  • http://python.org
  • Created in 1989 by Guido van Rossum
  • Clear, readable syntax
  • General purpose language
  • Well documented, free, and cross-platform
  • Expressive
  • Dynamic execution
  • Very high level, dynamic data types
  • Extensive standard library, and ecosystem of 3rd-party packages
  • Easily extended in C and C++

Python for Science & Engineering

  • http://scipy.org
  • NumPy - N-dimensional arrays
  • SciPy - Library of fundamental scientific algorithms (in many cases just Python wrappers around time-tested Fortran and C implementations)
  • Matplotlib - 2D plotting
  • IPython Notebook - enhanced Python shell in the browser with rich text, math notation, inline plots, ...
  • Pandas - Statistical data analysis and modeling
  • The list goes on...

The Rubicon for Python in Science

  • Curated distributions - Anaconda
  • Expanded and Improved Documentation for NumPy, SciPy, and friends
  • IPython Notebook

Python as Glue - Command Processors for Models

  • Python has several command line tool frameworks

SOG 1-D Coupled Physics-Biogeochemical Model

  • Based on the argparse standard library module

  • SOG run - Runs model from 1 or more parameter value files
  • SOG batch - Execute a series of runs, possibly concurrently

Salish Sea NEMO 3-D Model

  • Based on the cliff command line tool package

  • salishsea run tides.yaml iodef.xml ../results/tides3

  • Evolved from:
    • salishsea prepare - Create a NEMO run directory containing files and symlinks
    • salishsea combine - Combine per-processor netCDF results files; optionally compress
    • salishsea gather - combine + move run inputs, outputs & metadata to results directory
  • salishsea get_cgrf manages CGRF atmospheric forcing file collection

Marlin - NEMO Code Repo Maintenance

  • Operates on a repo that is both a Mercurial repo and a SVN checkout of NEMO
  • Automates pulling SVN updates 1 by 1 and commits them to Mercurial
  • Merging local changes and testing done manually (for now...)

Automated Testing of SOG 1-D Model

  • Manager process:
    • Listens for changes pushed to SOG code repo
    • Has schedule of weekly tests
  • Worker processes:
    • 11 test cases distributed over 7 workstations
  • Web interface for status and on-demand test runs
  • Email notification of test failures

Packages of Functions to Re-use and Share

Python Packaging Tools

  • Aggregate functions, class definitions, etc. in modules
  • Collect modules in packages
  • Namespaces
  • Manage dependencies

SalishSeaTools Package

  • bathy_tools - Viewing and manipulation of netCDF bathymetry files
  • nc_tools - Exploring and managing the attributes of netCDF files
  • tidetools - Analysis and plotting tidal harmonics results from NEMO
  • stormtools - Analysis of storm surge results from the Salish Sea Model
  • viz_tools - Functions to do routine tasks associated with plotting and visualization
  • hg_commands - API for Mercurial commands that other tools use
  • namelist - Parse Fortran namelist files as Python dictionary data structures

  • Publicly available in the https://bitbucket.org/salishsea/tools repo
  • Documentation at http://salishsea-meopar-tools.readthedocs.org/en/latest/SalishSeaTools/salishsea-tools.html

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

Daily, quasi-operational forecast of the 1st spring phytoplankton bloom in the Strait of Georgia:

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently
  4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds
  5. Create time series and depth profile plots
  6. Render a results commentary and the plots as an HTML page via a template
  7. Push the HTML page to a web site

Do all of that while we get on with other research!

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows

Requests - HTTP for Humans

http://docs.python-requests.org/en/latest/

In [11]:
import requests

url = 'http://climate.weather.gc.ca/climateData/bulkdata_e.html'
params = {
    'stationID': 6831,
    'format': 'xml',
    'Year': 2014,
    'Month': 6,
    'Day': 1,
    'timeframe': 1,
}
response = requests.get(url, params=params)

print(response.text[:1000])
<?xml version="1.0" encoding="utf-8"?><climatedata xmlns:xsd="http://www.w3.org/TR/xmlschema-1/" xsd:schemaLocation="http://www.climate.weatheroffice.gc.ca/climateData/bulkxml/bulkschema.xsd"><lang>ENG</lang><legend>
                   <flag>
                       <symbol>M</symbol>
                       <description>Missing</description>
                   </flag>
                   <flag>
                       <symbol>E</symbol>
                       <description>Estimated</description>
                   </flag>
                   <flag>
                       <symbol>NA</symbol>
                       <description>Not Available</description>
                   </flag>
                   <flag>
                       <symbol>**</symbol>
                       <description>Partner data that is not subject to review by the National Climate Archives</description>
                   </flag>
                </legend>
<stationinformation><name>SANDHEADS CS</name><province>BRITISH COLU

Requests - With Session Data

In []:
with requests.session() as s:
    s.post(disclaimer_url, data='I Agree')
    time.sleep(5)
    response = s.get(data_url, params=params)

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input

Data Processing & Transformation

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently

Subprocess Module

In []:
import subprocess

cmd = 'nice -n 19 SOG < infile > outfile 2>&1'

proc = subprocess.Proc(cmd, shell=True)

while True:
   if proc.poll() is None:
     time.sleep(30)
   else:
     print('Done!)
     break

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently
  4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds

Vector and Array Calculations

Lots of libraries for doing scientific calculations

NumPy is generally the foundation

For specific application areas and algorithms:

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently
  4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds
  5. Create time series and depth profile plots

Matplotlib

In []:
import matplotlib.pyplot as plt

fig, ax_left = matplotlib.pyplot.subplots(1, 1)
ax_right = ax_left.twinx()

ax_left.plot(nitrate.time, nitrate.values, color='blue')
ax_right.plot(diatoms.time, diatoms.values, color='green')

ax_left.set_ytitle('Nitrate Concentration [uM N]')
ax_right.set_ytitle('Diatom Biomass [uM N]')
ax_left.set_xtitle('Year Day in 2014')

fig.savefig('nitrate_diatoms_timeseries.svg')

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently
  4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds
  5. Create time series and depth profile plots
  6. Render a results commentary and the plots as an HTML page via a template

String Interpolation & Templating

In []:
page_tmpl = """
<h1>Strait of Georgia Spring Bloom Prediction</h1>
<p>
The median bloom date calculate from a
{member_count} ensemble forecast is
{bloom_dates['median']:%Y-%m-%d}
...
</p>
"""

page = page_tmpl.format(
   member_count=len(members),
   bloom_dates=bloom_dates,
   ...
)

with open('page.html', 'rt') as f:
   f.write(page)

String Interpolation & Templating

Templating libraries:

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently
  4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds
  5. Create time series and depth profile plots
  6. Render a results commentary and the plots as an HTML page via a template
  7. Push the HTML page to a web site

Subprocess (again)

rsync, scp, sftp, hg, git, ...

In []:
cmd = [
    'rsync',
    '-Rtvhz',
    '{}/./{}'.format(html_path, results_page),
    'shelob:/www/salishsea/data/'
]
subprocess.check_call(cmd)

SoG-Bloomcast - SOG 1-D Model Real-time Forecast

Daily, quasi-operational forecast of the 1st spring phytoplankton bloom in the Strait of Georgia:

  1. Get near real-time forcing data from web services
    • wind, weather, river flows
  2. Process forcing data into format for model input
  3. Run the SOG model 3 (or 30+) times concurrently
  4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds
  5. Create time series and depth profile plots
  6. Render a results commentary and the plots as an HTML page via a template
  7. Push the HTML page to a web site

Do all of that while we get on with other research!

Shell Script and Cron Job

Shell script to run SoG-bloomcast:

# cron script to run SoG-bloomcast
#
# make sure that this file has mode 744
# and that MAILTO is set in crontab
VENV=/data/dlatorne/.virtualenvs/bloomcast
RUN_DIR=/data/dlatorne/SOG-projects/SoG-bloomcast/run
. $VENV/bin/activate && cd $RUN_DIR && $VENV/bin/bloomcast config.yaml

Cron entry to trigger the script daily:

MAILTO=dlatornell@eos.ubc.ca
   
BLOOMCAST_DIR=/data/dlatorne/SOG-projects/SoG-bloomcast
# m h  dom mon dow   command
  0 9   *   *   *    $BLOOMCAST_DIR/cronjob.sh

Python Tools and Glue

  • Command Processors for the SOG 1-D and Salish Sea NEMO models
  • Code repo maintenance tool for NEMO codebase
  • Automated testing of SOG 1-D model
  • Packages of functions to re-use and shared use
  • Daily, quasi-operational forecast of the 1st spring phytoplankton bloom in the Strait of Georgia:
    1. Get near real-time forcing data from web services
      • wind, weather, river flows
    2. Process forcing data into format for model input
    3. Run the SOG model 3 (or 30+) times concurrently
    4. Analyze the run results to calculate the forecast bloom date as well as early and late bounds
    5. Create time series and depth profile plots
    6. Render a results commentary and the plots as an HTML page via a template
    7. Push the HTML page to a web site

    Do all of that while we get on with other research!

Part 3 - Visualization of Model Results

In support of the Salish Sea NEMO project we are developing a collection of IPython Notebooks that provide discussion, examples, and best practices for plotting various kinds of model results from netCDF files. They include:

  • Plotting code examples
  • Examples of the use of functions from the salishsea_tools package

The notebooks so far:

The links are to static renderings of the notebooks provided by the nbviewer.ipython.org service.

The notebook sources are in the analysis_tools directory of the public https://bitbucket.org/salishsea/tools/ repo.