What does a Data Scientist Do?
Data scientists have a multifaceted role and profound impact on industries in the digital age. They examine which questions need answering and where to find the related data. However, executives and colleagues do not always understand what Data Scientists can do and should do. Here we delve into Data Scientists’ day-to-day tasks: diagnose business’s needs, collaborate with data-IT professionals, handle data, and develop models for production. Through captivating stories, we explore how Data Scientists extract valuable insights and develop cutting-edge solutions, making them indispensable in today’s businesses.
Responsibility of Data Scientists
In a data team, the key responsibility of data scientists is to leverage critical and statistical thinking to explore and distil business value from data. Data scientists are expected to understand the business need, sort out the often messy and ill-structured data, and turn the data into valuable and actionable insights. Instead of focusing only on specific technical steps, data scientists often need to consider the bigger picture: “What business problems can data help resolve?”, “Are our data fit for purpose? Can we find more data?”, “What analyses can we do with the data? What are the costs, benefits, and risks of the resulting business problem solutions?”
With the bigger picture in mind, data scientists often play an overarching role that spans from discovering business use cases, finding and analysing relevant data, to reporting insights and productionising solutions. Usually, data scientists’ main duties are to explore and prepare data, and to experiment with solutions. They are expected to understand different data sources and the corresponding technologies needed to handle the data. They are tasked with finding the suitable statistical and machine learning tools to explain past data and predict future patterns. And they are responsible for presenting the resulting insights and solution ideas to business stakeholders, clients, and internal tech teams.
Therefore, data scientists usually work closely with business stakeholders, data analysts, data engineers, and machine learning engineers in each step of crafting a data science solution. It is not to say that data scientists are responsible for getting hands-on in everything along a data mission. Rather, data scientists are expected to be mindful of all the steps from data to solutions, and are responsible for aligning their analytical tasks at hand to both upstream and downstream needs.
In the next section, let’s dive deeper into what a data scientist might actually do during a day of work.
A Day as Data Scientist
Story 1: A consulting, client facing data scientist
focused on insight analysis
Meet Charlie – Charlie is a consulting data scientist that has been recently assigned to a new client. The client is a large energy company that wants to improve their forecasting model for electricity demand, which they will use to inform their production schedule strategy. They have been collecting new data about their customers that they believe will help improve the model’s accuracy. The project will involve determining whether this new data can be used to better predict electricity demand, so that the company knows whether to further invest in this data collection to reduce costs due to production schedule inefficiencies.
9 AM – 9:30 AM
Prepare for a client meeting: Charlie starts his day off similar to many other professions – catching up on emails and messages from his team/clients, and answering anything that requires an urgent response. He spends the rest of the hour getting ready for the upcoming client meeting, going over the project brief and reading up on the client’s business.
9:30 AM – 12 PM
Attend a project kickoff meeting: This is a project kickoff meeting – so Charlie makes sure to introduce himself to the rest of the team and any key stakeholders. During the meeting, Charlie aims to achieve the following:
Understand the value that this project is bringing to the client: In this case, Charlie learns that a large amount of money is lost due to operational inefficiencies caused by inaccurate forecasts of electricity demand.
Understand the deliverables and end users: Charlie makes sure that the team and client agree on the final deliverables, which include a detailed assessment report on the model improvement potential and ROI of the new data source. End users of the deliverables are identified to be the main planning team.
Determine any limitations that may affect the reliability of any insights: He learns of some potential issues with missing / inaccurate data due to issues the client had in the past with faulty equipment.
Determine the timeline of the project: The client provides a timeline of 12 weeks, with sample data being provided today and full access to the data to be organised over the next week. Charlie organises a quick chat with the client’s technical team to organise how the sample will be provided.
Understand data architecture: To understand the data he will be receiving, Charlie begins reading documentation provided by the client on their databases. He familiarises himself with the meaning of important attributes, and makes sure to understand how they were collected and any caveats that come with the data’s accuracy. He also works with the data engineers on his team to understand how he will be accessing the data – in this case, he learns the client stores their data in a cloud-based data warehousing solution, which he will need to gain access to.
Preliminary analysis: Charlie receives an email from the client containing data in the form of a collection of .csv files. This is the sample data containing historical electricity demand values, as well as other related attributes. For an initial look into the data, Charlie writes code inside a ‘notebook’ which allows him to develop and run chunks of his analysis in sequence (for this project, he is using Python inside a Jupyter notebook environment). He begins working on the following tasks:
Data Ingestion: Charlie creates a data ingestion pipeline that imports the various .csv files into the notebook environment.
Data Quality Assessment (DQA): Charlie performs an initial check into the quality of the data. The goal here is to ensure that the data is accurate, complete, and consistent. This involves checking for missing values, outliers, and inconsistencies. Charlie finds a few issues that can be fixed by imputing missing values, standardising data formats, and removing irrelevant data. He then transforms these steps into encapsulated functions and adds them into the data pipeline.
Exploratory Data Analysis (EDA): Charlie performs a preliminary analysis of the data to gain a better understanding of its structure and characteristics. This involves computing basic statistics, creating data visualisations, and identifying patterns and trends. By plotting electricity demand, Charlie gets a better understanding of the behaviours expected (he sees the shape that a daily demand curve follows, as well as the presence of several large outliers) and sees a few obvious correlations between demand and other attributes such as public holidays, location and weather. Overall, this step helps Charlie make informed decisions about how to proceed with his analysis in the future and ensures that he understands the underlying processes that are generating the data.
Client update and next steps: Towards the end of the day, Charlie writes a short email detailing a list of questions he uncovered about the data and sends it off to the client. He also figures out a list of next steps:
- Organise access to the full dataset. Charlie will need to figure out whether he can query the data in its entirety, or write SQL queries to transform the data into a form suitable for analysis.
- Functionalise pre-processing steps and add them into the data pipeline.
- Develop hypotheses to test, based on SME advice and the results of the EDA.
- Perform statistical techniques to test these hypotheses. Charlie might iterate on his hypotheses several times to generate useful business insights regarding electricity demand prediction.
- Interpret the results and communicate them to the client. This might include writing a report and creating a slide deck.
Charlie’s project will continue for weeks. Based on past experience, he factors in buffer time for potential data access delay. He also anticipates multiple rounds of data analysis and client discussion meetings, which are commonly needed to address evolving stakeholder concerns and requests. The project will be concluded with Charlie and the team presenting a final assessment report.
Story 2: A data scientist developing and
delivering a NLP product service
Jessica is a data scientist working with her company’s natural language processing (NLP) product team. The team builds custom API services for extracting information, such as names, addresses, and contact details, from scanned documents. The team has developed the core services and has been serving a few client companies. Jessica and her product/ML engineer teammates are now delivering a receipt handling solution tailored to a new client. It is now mid-Week 6 of a 12-week engagement with the client. The solution requires personal information identification and replacement (DE-PII), and developing a custom machine learning model that extracts key fields from the receipts.
9 AM – 9:15 AM
Work with clients to get data access: Jessica kicks off the day happily seeing that, after weeks of negotiation and waiting for client’s legal approval, the client finally provided access to the raw receipt PDF files. Today is the day to apply the DE-PII module to clean the raw documents, so that Jessica can retrieve the redacted files and start machine learning model training.
9:15 AM – 10:30 AM
Data transformation and pre-processing: Jessica logs into the client company’s system, and executes her deployed Python DE-PII module that detects and replaces personal information on the receipt documents. The module works fine and produces redacted files. However, she found that certain address fields are not replaced properly due to an unexpected special document layout. After investigating the module source code and some experimentation, Jessica identifies a few potential quick fixes. She documents the findings to be discussed with her product engineer teammates.
10:30 AM – 12 PM
Collaborate with engineers to implement data processing modules: In the daily stand-up, Jessica calls out the DE-PII module issues. Nathan, the product engineer on the team, agrees to work with her to resolve the issues right after the call. Jessica and Nathan jump into an hour-long session to fix the issues. They walk through Jessica’s experiment findings and discussed the new feature requirements. Nathan agrees that he needs to update a post-processing component in the DE-PII module and then re-deploy it. They agree on the format of the updated module output.
1 PM – 2 PM
Design machine learning pipeline and experiments: While Nathan is implementing the fixes, Jessica turns her focus to preparing the machine learning model training pipeline. She configures the team’s established ML training package, to enable up-sampling of receipt types that are under-represented in the available data set. Also, she plans to run multiple ML experiments, with or without the receipts of the special document layout, to assess the impacts on model performance.
2 PM – 5 PM
Engage with client stakeholders: Jessica meets the client’s data analytics team, which is newly engaged in this project. It is a scheduled workshop to explain the solution technicality, and socialise ideas to gain trust and collect feedback. The workshop helps the analytics team hold reasonable expectations toward the NLP service, and sparks discussion about potential new business use cases.
Train and evaluate machine learning models: Not long after the client workshop, Nathan told Jessica that the DE-PII tool has been re-deployed and ready to go! Jessica runs the tool over the entire receipt data set, and happily sees that most documents are accurately de-identified now. She sends the redacted receipts to the ML development cloud environment, and completes the data cleaning and preprocessing tasks there. Finally, she executes the planned model training pipelines. Five model experiments are kicked off at the same time, each takes several hours. Jessica can’t wait to check out the model training outcomes, evaluate model performance and behaviour in the next morning.
Over the remaining weeks of engagement, Jessica will experiment with various ML modelling and feature engineering ideas to boost model performance (extracting information from receipts with high accuracy). Luckily, she has access to an established cloud-based ML experiment tracking pipeline, which is a great productivity booster. In the final week, Jessica will provide a live demo and train the client’s tech team to start using the deployed NLP service
Story 3: Data Scientist working with Machine Learning Engineer (MLE) in productionising a solution
Tom is a data scientist that works for an E-commerce company. For the past couple of months, he has been developing a model that will be used to deliver personalised offerings for individual customers. The model is based on customer data such as past purchases and browsing behaviour. Recently, he has wrapped up his model development process, with the final model having reached satisfactory levels of performance. The next stage of the project is to deploy the model into production. The company wants to use A/B testing to determine whether the personalised offerings actually results in a higher add-to-cart rate when compared to the existing website.
9 AM – 10 PM
Containerise model: Although Tom has already finalised the input features, architecture and hyper parameters of his model, he still needs to package it into a container to ensure it can run across different environments and be ready for deployment. To do this, he first pulls out all code related to preprocessing and model prediction from the Jupyter notebook he has been working in and moves them into separate Python scripts. He packages the model code and any library dependencies into a container image that allows other developers to easily run the model (this is achieved by using the tool Docker). He makes this image available to the MLE team for deployment.
10 PM – 12 PM
Design A/B test: Tom now needs to design an A/B test for the personalised offering model. This is an experiment where the new personalised offerings will be served to a random selection of website users, whereas the other users will receive offerings as they are currently. He performs the following:
Identify the goal: Before anything, Tom defines the goal of this project to increase the average number of purchase completions per user session.
Define the metrics to be measured: For this A/B test, Tom defines that the primary metric to measure will be the ‘add-to-cart rate’. This is the percentage of visitors that add at least one item to their cart during a session. The hypothesis is that the new personalised offerings will result in a higher add-to-cart rate.
Figure out the sample size: Tom, from discussions with the marketing team, determines the time length of the experiment based on the sample size required to be confident in the results. Tom then documents his A/B test execution plan into a report that he will deliver to the engineering and marketing teams, as well as other key stakeholders.
1PM – 5 PM
Work with MLE to create a production pipeline: Tom hops in a meeting to collaborate with the MLE team to design a production pipeline. This involves creating a scalable, reliable, and efficient system that can serve the model results to the web application. In the meeting, they discuss the requirements of the architecture.
- It must be able to generate near-real time predictions during a user’s session.
- It needs to be served only to a selection of users to satisfy the A/B testing requirements.
- The model needs to be automatically retrained daily on new data.
- Performance needs to be continuously tracked through monitoring/logging.
In their meeting, they decide on the tools/frameworks that they will use to satisfy the above requirements. They need to take into account the existing technology stack being used at their organisation. They draw out architecture diagrams and write documentation to justify and detail their choices.
Tom and the MLE team will present their recommended plan to the leadership team (both technical and business domain) to get sign-offs. The deployment and A/B testing will take several months to complete. Once the testing results are available and analysed, Tom will present the results to leadership and discuss if the new personalisation model should be rolled out at a larger scale.
The many faces of Data Scientists
We can see that between our three data scientists, there is an overlap in the underlying skillset, however the way in which it is applied can differ significantly depending on context (e.g. industry, organisation and project). Some of the key differences are below:
In-house vs. consulting DS
It is clear that there are some significant differences in the type of work done by in-house (Tom) versus consulting (Charlie, Jessica) data scientists.
Stakeholder management: We can see that Charlie needs to spend time meeting with their client, understanding their needs, clarifying the problem and communicating their ideas. In contrast, Tom already has a much deeper understanding of what is needed at his company. Although he still needs to collaborate with other departments, there is less of an emphasis on interpersonal skills when compared to consulting data scientists. Due to this, it is common that in-house data scientists like Tom can spend more of their time performing deep technical work when compared to consultants.
Problem domains: Consultants often have to adapt to new problems domains. Charlie had to quickly learn about the electricity sector in order to have the context required for his project. On the other hand, Tom has worked at his E-commerce company long enough that he has had the time to develop a familiarity with the kind of problems encountered in that space.
Technologies: In a similar way, consultants often have to deal with new technology stacks when they start a project. Charlie has to spend some time familiarising himself with the technologies being used by their client. This provides him with the opportunity to be exposed to a wide range of tools being used in the data world, whereas Tom has the opportunity to become a strong expert in the technologies being used at his company.
Project duration: While an in-house Data Scientist like Tom can often see a project go all the way from inception to the productionisation of the model, Charlie’s projects are on average going to be shorter, and have a more defined scope with specific deliverables and deadlines to meet. For instance, Charlie’s project is split into several phases – with the current phase focused on insights as a precursor to future model development phases.
Insight analysis vs. Product development
Another contrast is between data scientists who are focused on insight analysis (Charlie) versus those who are developing a product (Jessica).
Generating insights requires a data scientist to dig deep into the data, uncovering patterns and trends that can help inform business decisions. The nature of this work is largely exploratory – we saw Charlie spend his time getting familiar with the data by working in a Jupyter notebook. In fact, sometimes a data scientist may not even know exactly what kind of insights can be extracted from the data before working with it. To understand what kind of insights would actually be valuable to the business, Charlie had to work closely with the stakeholders to understand the business requirements and define research questions. On a technical side, this sort of work emphasises the statistical analysis, visualisation and communication aspects of data science.
A data scientist assigned to developing a product is responsible for building a data-driven product or feature that meets the needs of customers or end-users. In order to integrate these new features into the larger product, some expertise in engineering and software development practices such as CI/CD is required. For example, Jessica, to develop a DE-PII module for her company’s NLP product, needed to work closely with engineers such as Nathan to successfully implement this feature. Compared to Charlie, Jessica’s work is often more well defined, with clear targets and requirements rather than having the freedom to explore the data.
Differences between data engineer, data scientist and MLE
The above stories highlight the different roles that a data engineer, scientist and MLE play in a project, as well as where they may interact and collaborate on a problem.We saw that a data scientist is more likely to interact with data engineers in the earlier stages of the project. For example, Charlie worked with data engineers to figure out how he would be accessing the data for model development, and is likely to work with them to implement any transformations he used during his modelling process into the data pipeline.
On the other hand, the MLE plays an important role after the data scientist has completed their model development work. We saw Tom begin to work closely with the MLE team to figure out the best way to get the model he developed into production. It is important for the data scientist to remain involved in the project, as they can provide key information about the features that the model is expecting, as well as any performance thresholds that the model should be meeting. For example, Tom may advise the MLE on the kind of retraining strategy that should be implemented.
Overall, each role has an important part to play in the lifecycle of a data-driven project.
- Data engineers build and maintain the infrastructure needed to store and process data
- Data scientists analyse and model data to derive insight
- MLEs deploy and optimise machine learning models in production environments.
It is common to expect a significant amount of interaction between the roles, as well overlap in skillsets.
We have explored the typical duties and responsibilities that might be assigned to a data scientist. Throughout our three data scientists’ days, we have seen examples of the following key tasks:
- Data collection
- Cleaning and preprocessing
- Exploratory data analysis
- Data visualisation
- Statistical analysis
- Model development
- Communication and collaboration
- and much more.
Although the exact focus and emphasis of these tasks varied, each story demonstrated how the data scientist applied this core skill set to deliver value to their organisation. We have also seen how data scientists often work closely with other roles (data engineers, MLEs) to produce a quality outcome.
Ultimately, it is the ability to effectively use data to drive insights and inform decision-making that is what ties all data science roles together. By leveraging data science techniques and tools, businesses can gain a competitive edge and drive growth and profitability. With the amount of data being collected all over the globe growing exponentially every day, the importance of data scientists will only continue to increase as we move towards the future.