Skip to main content

What a tool…my journey learning Python

· 5 min read

I think I can….

I’ve recently been working on getting my Data Science Certificate from NAIT, which is a local polytechnic school.  In one of our classes our instructor gave us the Titanic competition from Kaggle to experience what an end-to-end machine learning project is like.  It has been frustrating, challenging, sometimes daunting, but yet super helpful.  Even now, a week after the class I am still working on my model trying to get the accuracy higher and improve my ranking on the Leaderboard.  And still wondering how on earth so many people got 100% accuracy?

As I went through the data wrangling process I couldn’t help but get frustrated with how much easier (for me) it would be to do some of the transformations in Excel.  Now, I’m not saying it’s actually easier or better in Excel because let’s be honest, the Titanic training data set is only around 800 rows.  This is not exactly “big” data.  What I am saying is that my skills in Excel are so much better than my skills in Python that it was hard for me not to switch to a tool with which I have more experience.  But I did my first 2 submissions to Kaggle without giving in to that impulse. 

Why would I choose the harder route you ask?  Because the whole purpose of taking the course was to improve my Python skills, not show off my Excel skills.  And as hard as it was to struggle and constantly hit barriers – ones I knew exactly how to overcome using Excel – it was also rewarding when I overcame many of these barriers using Python.  Now, truth be told, I still haven’t figured out how to impute the missing age values in the exact way I want but I’m still working on that.

However….

Once I did my first 2 submissions to Kaggle using only my coding skills (which excluded the Age feature in my predictive models), I was curious to see if I could have built a better model using Excel.  In particular, I wanted to see if my idea for imputing age would have made a difference.  I had an idea to create a new feature column using the passenger name to determine if the person was married.  Then, I would use this to impute the missing age values using mean and a combination of the married and Pclass features.  I uploaded this fully wrangled file and applied the same code for my predictive models that I used in my code-only work.  And you know what happened?  My Excel-wrangled data performed slightly better on the training data set but slightly worse on the test data set.  Of course there are lots of potential reasons for this and it might be different with the next data set but it got me thinking about some things.

So now….

Throughout my career I have tried to be tool or technology agnostic.  What does this mean?  It means that I will pretty much use whatever tool I need to get the job done.  And this can make a difference because cost is often the determining factor and you might not always get to use the same tools at different companies.  At the end of the day it’s about the results of the work and not that I can brag about being able to use so-and-so tool.  (Of course, there are caveats to that.  Depending on what you are trying to achieve or if you’re building a repeatable process, the subset of tools from which you should choose might matter more.)  And my Python vs Excel struggle continued to prove this point.

I am very comfortable in Excel but not comfortable (at all, in truth) with Python….yet.  However, my experiment confirmed my need to be tool agnostic.  The more I learn Python, the more I will be able to do in Python and be “tool agnostic” by not wanting to switch to a tool I know better.  Right now Python is definitely in the top 3 for most used tools in data science.  Just a few sites that mention Python’s popularity are here, here, here, and here.  I know it sounds like I’m getting on the bandwagon but really, the more tools I know how to use, the more tool agnostic I can become.  I can choose the best tool for the job rather be a hammer looking for a nail, as the cliché goes.  The Titanic data set was easily wrangled in Excel but what about a data set that’s over a million rows?  Probably not as easy, in which case I’d want to know a tool like Python.

It’s still a long path and most days I feel like I’ll never be able to learn fast enough given how quickly the field of data science is progressing, especially with all the data science libraries that work with Python.  But to borrow a very powerful phrase that keeps me going when I feel like giving up: “Nevertheless, she persisted.” And so, off I go to keep trying different ways to impute age using Python and maybe I’ll try the random forest as my next predictive model. But then, off to the next competition and data set and a new challenge.