As part of the research for re-submitting the CREATE grant I submitted last year. I have spent quite a bit of time talking to people who work in data science (mostly former astronomers), looking at job ads for “data scientists”, reading blog posts and articles, and so forth. The goal is to figure out what can be added to astronomy graduate students’ training to make them more accomplished users of data in astronomy, and more hireable outside of astronomy should they choose to go that route. My samples of job ads and interviewees are fairly small, but there are enough similarities in what I hear from the different directions that I’m reasonably confident in the results.

I looked at about 10 job ads on LinkedIn for “data scientists” in the Toronto area. The expected educational background varied quite a bit. Some specified a background in computer sccience or statistics, others expanded this to include mathematics or applied math, or further to engineering or a physical science. One just specified “STEM”. Degrees in machine learning were specified in several ads, which is interesting since there aren’t that many such programs The University of Alberta has a graduate program and University of Toronto at Scarborough an undergrad one but I couldn’t find any others with machine leraning in the title.

The number one skill, listed in every job ad, was “excellent communications skills.” An example of this requirement that I thought was particularly clear, and something that echoed what I heard from interviewing people in industry, was “ability to communicate complex quantitative analysis in a clear, precise, and actionable manner.” Technical skills that appeared in most of the ads included SQL, Python, and statistics or machine learning. Many of the ads included mention of R, HDFS/Hadoop, and Spark. Mentioned less frequently were data visualization, MATLAB, Excel, and cloud computing.

The people I interviewed also mentioned Python, SQL, and statistics/machine learning frequently. The other technical skills mentioned above also came up in a handful of interviews. Several interviewees were of the opinion that job ad skill lists tend to be aspirational rather than realistic; posession of at least some of them, plus the proven ability to learn new skills quickly (which we might hope that a graduate education provides) was sufficient in their opinion. One skill that wasn’t meantioned in the job ads but was in most interviews was software engineering, including the ability to work in teams on a computational project, using version control (especially git and GitHub), and writing re-usable code. (One interviewee told me that people in industry write lousy code “just like in academia”, so this wasn’t universal.) What did seem pretty universal was the idea that astronomers in particular need to get more used to the idea of applying existing tools rather than writing their own; our tradition of re-inventing the wheel for our own private use is no longer sustainable.

What do physicists or astronomers bring to data science? Experience of working with large amounts of real data, some of which may be noisy or unusable, is vital. Solving large, open-ended problems using quantitative methods is probably the most important thing. The big differences are that in industrial data science the timescales are usually faster, and the solutions don’t have to be as perfect, but they do have to be actionable. Of course problem-solving and working with data are not unique to physics or astronomy; lots of other areas of science provide similar backgrounds.

For further reading, the profiles of Jessica Kirkpatrick, Sean Farrell, and Adam Hill all have interesting insights.