A meandering through data science and engineering adventures in Julia, Python, and R.
My name is Alun ap Rhisiart. By background is in zoology; I have a BSc in zoology and a DPhil (the Oxford name for a PhD) in behavioural ecology and evolution theory. My research was around communication, predator-prey behaviour, and spider web building. As part of that I did a lot of work with computers, including building genetic algorithm models of the rules spiders may use to build orb webs. The predator-prey part was filmed for David Attenborough’s The Trials of Life series, and the genetic algorithm work was presented in Richard Dawkin’s book Climbing Mount Improbable.
Eventually, after some years post-doc research at the universities of Oxford and Basel I left academia to take up a career as a software engineer and trainer. I worked for IBM, Computer Science International, Toyota, and London Processing Centre, and more. I wrote the first XML course for IBM Learning Services and presented it internationally. I then went on to work for 15 years at JP Morgan Chase on the prime record system for the investment bank. During this time, I worked in Smalltalk, Object Pascal, Objective C, Swift, Ruby, and Python amongst others.
Then I made a final transition (and wish I had done it years earlier) into Data Science, Machine Learning and Statistics, and Data Engineering in the education sector. These days I work in Python, R, Julia, and SQL. I spend a lot of time collating data both from internal sources as well as public data sources, and putting it together in Databricks in a Data Vault 2.0 architecture using dbt and Prefect. I build models using scikit-learn, xgboost, and huggingface transformers, as well as statistical models in R, and track experiments using Neptune and MLFlow.
Along the way I have inevitably run into blocks and had to find ways around them. I hope to share some of the solutions here to aid others hitting the same issues, in partial recompense for all the great support I have received over the years.