The honest truth about what it means to work as a data scientist

No items found.

Learn what it's really like to works as a data scientist and how to make it more interesting.

Download for Free

Thank you for your submission. You can access it here:

Oops! Something went wrong while submitting the form. Try it again please.

Data science has been called the sexiest job of the 21st century by some people, who have obviously never seen the inside of a fire station.

And sure, working as a data scientist can be extremely rewarding. You tinker with shiny new machine learning algorithms, think deeply about interesting problems, and your work has a direct impact on the company’s bottom line. By discovering new revenue sources, optimizing costs and using analytics, you can boost and accelerate your organization’s growth.

But boy oh boy, does data science look different on the frontline.

Let’s take a peek behind the curtain to see what the ‘average Wednesday’ of a data scientist actually looks like.

1. Data is messy. Like, really messy.

Once you move from the world of academia and Kaggle competitions into industry, something strikes you hard: no one prepares you for how messy real-life data can be.

Seasoned practitioners have been grinding their teeth over this problem for years. But the fact remains that data scientists spend on average more than 80% of their time gathering and cleaning data.

Why is this?

Well, there are a couple of reasons:

Machine learning algorithms need specific data formatting. When preparing data for machine learning algorithms (versus reporting in Business Intelligence), you need to structure data in a specific way. If you don’t, the model will not accept the raw input and learn from it.
Shuffling responsibility to the last mile. ETL pipelines process the data from its raw form to its usable form. During an ETL, you Extract data from different sources, Transform (clean) data, and Load data in a database. But unless we build in constraints and tests at each stage of the pipeline, we will end up with messy data. It’s left to the frowning data scientists, who are the last ones to touch it, to clean the data if they want to use it.
Data cleaning is a long process. Quite frankly, there are a lot of things that you need to check when cleaning data. From missing values and type coercions to measures of centrality and dispersion, the list is extensive. It takes a long time to go through it all if you want to assert a high quality of data, especially if you do not automate this process and so have to repeat it continuously.

2. Getting usable data takes longer than expected.

Messy data is just one part of the equation. Gathering data is the other side of the 80% coin.

And here, we’re not just talking about designing the ETL pipeline that helps you to obtain the necessary data for your work. Sure, some data scientists are lucky enough to have data engineers who build ETL pipelines for them, but that’s not the case for everyone.

Even when the ETL is set, there are other challenges involved in getting usable data. On some level, this is expected. It’s in the data science hierarchy of needs that acquiring data will take up a bigger chunk of your time:

‍

On the other hand, there are challenges in data governance which are difficult to surpass:

Locating data. The majority of companies have data dispersed across different silos. Oftentimes, a data science task starts by talking to multiple stakeholders to figure out where the data is even located within the company. Time is wasted locating data sources because data is not centralized into a single system.
Joining data. When working with multiple data sources, data scientists need to perform the joining data ritual, which entails sacrificing a… no, it doesn’t. But it does take an awfully long time to join disparate data sources, decide on the data model which is ultimately needed, and to understand which data is duplicated and needs to be removed.
Understanding data. Whenever you work with a new dataset, there are column names and values which are unclear. Unless your company deploys a data catalog, with explanations for each field and what its values mean, you will spend a lot of time scratching your chin trying to work out the difference between the columns “acquisition_date” and “date_of_acquisition”.
Tracing lineage. For many tasks, it’s important to understand where data comes from and where it goes. GDPR compliance is just one of these examples. However, lineage is often not traced, and it is seldom recorded exhaustively. Subsequently, a lot of time goes into figuring out the data journey through your company.

3. Rinse and repeat. Rinse and repeat. Rinse and…

All repetitive work and no play makes Jack a dull data scientist.

Unfortunately, running the entire data science pipeline (ETL > data understanding and gathering > data cleaning > (finally) modeling) is a laborious process. This is a horrible pain, given that a lot of data science is based on exploration and experimentation.

Unlike some other fields of engineering, data scientists find the best algorithm for the job by running multiple variations, tuning hyperparameters, experimenting with new features, and tinkering in general.

Imagine if a structural engineer built bridges by experimenting and building multiple versions until the best one was found. We probably wouldn’t be so happy to finance those projects. And yet the processes and tools given to data scientists push them exactly in this direction - spending long cycles waiting for data before they can run the experiments needed to reach viable business conclusions.

4. Developer tools are not ready for data scientists (let alone seeing them work on a project together)

Since its inception, data science has been the love child of three fields: mathematics, software engineering, and domain expertise.

The intersectional nature allows data scientists to see beyond the blind spots of each field and come up with creative ideas to solve business challenges. But as with any family, there are bound to be some conflicts.

The tools that are often used by data scientists are ones that work great for software developers. For example, Git for versioning and collaboration. But anyone who has ever pushed a Jupyter Notebook to master will know that Git is a poor choice for versioning and sharing (try reading this blame message).

Even when we find other ways of collaborating on Jupyter Notebooks, the crucial problem remains: there are no smart tools for sharing data.

Why does this matter? Because sharing data is the cornerstone of doing data science work. Sharing data allows you to:

Rerun another data scientist’s work to check if it reproduces as expected.
Share the results of a complex analysis without having to rerun the entire analysis.
Use the previous work (data) as a baseline for you to continue and improve upon.

This is especially painful during times when remote work is becoming the rule rather than the exception. Having tools and processes which support collaboration is paramount for a data scientist.

Is there a bright side?

Before you start doom prepping and looking at job boards for a career change (orange picking in Sardinia sure does sound fun), let’s look at things in a different way. Every problem we discussed is a challenge that needs solving.

Being aware of the problem positions you ahead of it. You can start solving them.

And this is the approach you should take:

Centralize all of your data into a single platform, so that you don’t have to waste time looking for it.
Leverage a data catalog to help you understand the data and its lineage.
Automate cleaning tasks, so that you can set-and-forget it.
Opt for a data platform which allows you to work in parallel with your colleagues and share data.

If you'd like to start implementing above approach immediately, you are welcome to create a free account in Keboola.

Online