Data science came a long way from the early days of Knowledge Discovery in Databases (KDD) and Very Large Data Bases (VLDB) conferences. 1980s-90s software engineers handling databases evolved into specialized database engineers in the 2000s. Meanwhile pockets of computer scientists in smaller research labs experiments on machine learning and artificial intelligence. The big data meets smart algorithm collided in a Cambrian explosion in the 2010s, making “Data Scientist: The Sexiest Job of the 21st Century”. That brings us to a decade later, post-pandemic 2022, asking the question, “Is Data Scientist Still the Sexiest Job of the 21st Century?”.

Why are you writing this article?

Pardon the short cut-away, but this article is written in conjunction with the 2022 Noonies Award. The HackerNoon’s 2002 Noonie Awards celebrate the technical writers sharing their best and brightest insights in all the things tech.

A Formal introduction:

Hi, I’m Liling. By day, I am an applied scientist in Amazon and by after-work, I code open source and write tech articles on natural language process and sometimes articles on gaming pop-culture.

It is a joy and honour to be nominated in the Hackernoon Contributor of the Year for Natural Language Processing (NLP) category and if you have enjoyed by NLP or Machine Translation content that I’ve been sharing, help smash the vote button at https://www.noonies.tech/2022/programming/2022-hackernoon-contributor-of-the-year-natural-language-processing

To celebrate the nomination, I’m writing up this article in a “Ask Me Anything” questions and answers format.

As a tech writer, I love to share the emergent technologies in machine learning and I have a particular soft-spot of language and translation related technologies. To celebrate the nomination, I’m writing up this article in a “Ask Me Anything” questions and answers format. Learn more about my thoughts and opinions on “what kind of a scientist am I?” in the tech industry in the follow sections.

Back to the “Sexiest Job in the 21st Century”

Nowadays, job description for “data scientists” comes in different forms and it falls broadly under these categories:

Data Scientist
Research Scientist
Applied Scientist
Data Engineer
Research Engineer
Machine Learning (ML) Engineer

If you ask anyone about the difference between the role and responsibilities of the different job titles, you will most probably end up with a vague line that delineated each of them.

If you ask anyone about the difference between the role and responsibilities of the different job titles, you will most probably end up with a vague line that delineated each of them. In reality, it is usually a fuzzy overlapping scope of work that differs based on the company’s and team’s role definitions. The major difference usually comes between “Scientist” and “Engineer” roles where the scientist is usually expected to focus more on the data and model quality side of things while the engineer focuses more on the model integrity and service reliability.

Q: What data or model quality?

This is usually the responsibility of the “scientists”. In the industry, this is specific to the different task and applications the team is supports and/or develops. It it similar to the academic researchers building machine learning model but the practicality of whether the final model is usable usually trumps the need to beat the state-of-the-art results in the industry.

Data quality tasks usually involves:
- What open source data can you use to train/improve the model?
- Who owns internal data sources that you can use to train/improve the model?
- How to extract, transform, store and load the data to fit the model?
- How to improve the quality and size of the data?

Model quality tasks usually involves:
- Finding the right algorithm or network architecture to use to solve the task
- Defining/Refining the evaluation framework use to evaluate the task/application
- Improving the model performance based on a defined evaluation metric/framework
- Optimizing the speed and performance tradeoff for the algorithm to make the model usable in production

Q: What is model integrity and service reliability?

This is usually the responsibility of the “engineers”. Reliability is critical to any modern machine learning applications today. It is important to make sure that scientists’ carbon-emitted efforts to produce the best model for the customers/users produces the expected performance in production.

A scientist’s “it works on my laptop” statement is unacceptable in the industry and engineers help to make “it works, anywhere” a dream come true.

Model Integrity tasks usually involves:
- Building and maintaining the framework to automate model training and deployment
- Making sure features/improvements made in experimental projects are available in production models
- Incremental improvements to automate experimental setups to reduce/eliminate manual steps in bringing scientists’ model to production.
Service reliability tasks usually involves:
- Setting up alerts and monitoring users’ application usage and if/when it machine learning model fails/break
- Specifying and limiting users’ access to model to comply with internal/national/regional regulations
- Making the service accessible to increasing users and load

In modern days, sometimes these engineering responsibilities is known as Machine Learning Operations (MLOps), Chip Huyen has a good blogpost that gives an overview on MLOps for aspiring ML/Data/Research engineers.

There are many other definitions of what machine learning, data, applied, research scientists/engineers do but the above is from my personal industry experience.

Q: Should I go for Scientist or Engineer?

It depends! And as discussed earlier, it varies from company to team and everyone should always ask the hiring manager about the expected responsibilities during the job application process.

A good scientist should be able do some engineering tasks. Vice versa, a good engineer should be able to build some machine learning models.

Personally, as a scientist, these are my advice that I give to aspiring/new scientists:

Knowing some backend/frontend engineering helps
Know what’s possible, what’s easy, what’s hard for the engineers
Learn from engineers (dockers, databases, cloud, apps design/dev)
And let engineers learn what you do

And a final note that I always try to remind myself,

P/S: An engineer might train a better model than a scientist do.

Q: Let’s talk practical, is there a difference between Data, Research or Applied Scientist?

Roles and responsibility wise, they are similar but in practical terms some companies might have clear demarcation between the different scientists positions, so always as the human resource (HR) personnel or hiring manager if it’s possible to share the “role guidelines” specific to the position you are applying to and especially important to understand the expectations of your role once you joined the company and team.

Q: Yeah, that’s all nice and good about tech, career, tell me more about the dough ($$$ difference in practical terms) for data, research or applied scientist!

I’m personally a “practicalist” in most cases, but when it comes to “the dough”, https://www.levels.fyi/ and asking friends/seniors in the companies are your best bet to know more about the company and their compensation.

My personal opinion:

“Don’t do it for the money” is over-rated. Do it for the love of doing it. I enjoy looking at numbers and the language data, thus NLP. But remember to get paid enough for doing it =)

Onwards from the career discussion, now the tech part!

I’ve discussed the differences between scientist and engineers in the machine learning field and now I’ll try to answer a pressing question that almost all scientists would ask:

Q: I have problem X, which tool / method Y to solve it?

This is the usually the worst form of StackOverflow questions as per the “How to ask a good question” guide but I think it is something that the community should try to answer whenever we can.

My personal opinion:

There is no “bad” question or “need more focus” to these practical questions. But it does inevitably sometimes attract malicious product/tech advertising.

Here’s my 10-steps approach to answering X problem, Y approach, as a “scientist”, …

Literature review
1. The more you read, the more tools you have at hand
2. But limit your time to avoid rabbit holes, maybe try “Paper-Blitzing” =)
Know what are the datasets available and what’s in them (noise, quirks, etc.)
Find which evaluation metric is task X usually evaluated on
Track the oldest relevant citation of the task, read that paper
Find the highest cited paper for the task, use that as your baseline
1. Whenever possible, hunt down the datasets in that highest cited paper and latest shiniest paper
Define your success criteria for the task industrially (it might not be the standard eval metric for the task)
Try to replicate or reimplement the baseline
Communicate your model/libraries to engineers. Can your engineer productionize it?
Did baseline meet the success criteria? Ask the business/project stakeholder whether it’s sufficient
Build it, test it, break it, repeat!

Q: Wait a minute, does that mean that there is no “one true algorithm/tool Y” that I can learn to solve task X?

Yes, there isn’t.

From personal experience, the tool/model that makes it into your customers’ hand usually depends heavily on the Step 6 to 9 of the approach described above.

Q: What’s next in Machine Learning and NLP (that you’re personally excited about)?

At the moment, I’m spending my free time learning about Huggingface 🤗 and not just about how to use the different components of the library but more so in understanding what features make it a success and what’s the X-factor that made it gained traction in the machine learning community.

And the next thing that I would invest my time into is quantum ML, if I have even more time =)

So long and thank you for the fish!

I hope the above Qs and As give you some insights to “what kind of a scientist I am”. And if there are more burning questions you want to ask, feel free to leave the comment under the post.

Finally, I want to give a huge thanks the HackerNoon community, staffs and sponsors for the Noonie Awards nomination and if you enjoy this article, help smash the vote button at https://www.noonies.tech/2022/programming/2022-hackernoon-contributor-of-the-year-natural-language-processing

What Kind of Scientist Are You?