The Data Science Process
May 11, 2023
DataSciencePursuit
What is the Data Science Process?
The data science process outlines the standard steps that data scientists follow when tackling problems. It is also known as the data science life cycle since each step informs and guides the next. The data science process is typically iterative, meaning that data scientists may go back and forth between some steps as needed. This unit summarizes these steps.
Before we dive in, let’s think back to the previous unit: What is Data Science? You may have realized that data scientists primarily analyze and prepare data for machine learning models. They are like teachers in that they understand the subject (domain knowledge), prepare instructional material (prepare data), teach (train the model), and test students (evaluate the model). If all test results are poor, then the teacher may need to revisit their teaching methods or material.
The Data Science Lifecycle
Note: there is no need to memorize the details of each step. This summary will be here for future reference as you go through courses or projects.
1. Problem Definition
This is a crucial first step, the planning stage of the process. Data scientists work with customers to understand the business problem, identify relevant data sources, and outline the project goal(s).
Customers have different perspectives and challenges. Some are not sure what they want or what data science can do. Without clarity, data scientists risk answering the wrong question. Domain knowledge is valuable here, but if the data scientist is new to this business, communication skills and curiosity come in handy too. Data scientists need to ask the right questions to understand their customers’ needs.
2. Data Acquisition
The next step is to get the data needed to solve the problem. Depending on the data sources, data extraction can be simple or complex. If the data required is from multiple sources, the various datasets may also need to be aggregated or joined. Data source examples are MS Excel, text documents, databases (SQL or NoSQL), websites, images, etc.
To keep things simple, the data science fundamentals learning path will cover data extraction from a CSV (table-like format with rows and columns).
Here is an example of the data you will be working with:
School | Age | Address |
---|---|---|
GP | 16 | Urban |
MHS | 18 | Rural |
GP | 16 | Rural |
In data science, a row (excluding the table headers) is referred to as an observation or record, and a column is a variable or feature.
3. Exploratory Data Analysis (EDA) and Data Cleaning
We combined these steps because data scientists may go back and forth between EDA and data cleaning as each step influences the other. Moreover, although it is typical to start with EDA, there are situations where it may be necessary to start with data cleaning.
Exploratory Data Analysis
At this step, data scientists gain an understanding of the data by using statistics (data summarizations) and visualizations. It is important to be curious and ask the right questions to sufficiently examine the data. If anything seems interesting or concerning, it is crucial to question why or how it could have happened and determine ways to conduct further investigation.
EDA mainly involves:
- Finding any data quality issues such as errors, missing values, duplicates, or inconsistencies, and determining or planning the best way to deal with them.
- Identifying preliminary patterns or insights that may add value when modeling.
Data Cleaning
During EDA we uncover data quality issues and make a plan to resolve them. In the data cleaning step, we implement that plan to improve data quality. Machine learning model success heavily depends on data quality, hence the popular phrase, ‘garbage in, garbage out.’ Using unclean data can lead to unreliable insights. Since most real-world data is unclean, data cleaning is crucial to obtain the best results.
Note: I generally explore data to find data quality issues first, clean the data and then explore it again to find patterns.
4. Modeling
This is the fun part; data scientists train machine learning models and evaluate their performance.
Modeling involves:
- Feature engineering: Create new variables from existing ones that could enhance the model. Creating good features generally requires domain knowledge.
- Data partitioning: Split the data into training and testing data sets.
- Model training: Apply appropriate models on the training data and tune the models to get better results.
- Model evaluation: Assess model performance on the testing set and interpret model results.
5. Presentation
To gain acceptance, data scientists need to communicate results in a way that their customers can understand. They must also be able to show the value of their analysis. It is best to tell an engaging story with the data by using visualizations.
6. Deployment
The final step is to implement the model in the real world, where it can be used to make decisions or predictions. Deployment may also involve monitoring the models performance over time.
Flexibility in the Data Science Process: Adapting to the Real-World
The data science process does not always look the same. It can be flexible depending on the type of data analysis (analytics) and the organizational structure.
Organizational Structure
In smaller teams or organizations, data scientists likely perform all the steps of the data science process. They may even perform tasks beyond the data science process.
However, in larger teams or organizations with more specialized roles, data scientists may only perform certain steps in the data science process. For example, data engineers can handle data extraction and cleaning, while data scientists focus on analysis and modeling. Similarly, business analysts can bring their domain knowledge to help define the problem and interpret the results. Effective collaboration among data teams ensures that the data science life cycle is properly followed.
Type of Analytics
There are mainly four types of analytics that data scientists do:
1. Descriptive Analytics involves finding patterns in data to report what happened in the past. This analytics excludes the modeling step.
2. Diagnostic Analytics goes a step further to understand and explain why something happened. This analytics and the ones after it include all steps in the data science process.
3. Predictive Analytics goes a step further to predict the future.
4. Prescriptive Analytics goes a step further to recommend the best course of action for the predicted outcomes.
Conclusion
The data science process outlines several steps that help data scientists extract insights from large, complex datasets. From defining the problem to presenting the results, each step is critical to the project’s success. It is important to remember that the process may require revisiting and revising each step. Additionally, different analytics and organizations may require different steps or variations to the process.
Most of these steps are optimally done using programming, so the natural next step is to learn to program. But first, we will explore what programming is.