Understanding Data Science

June 27, 2023

DataSciencePursuit

This page summarizes the key takeaways from our Introduction to Data Science course.

What is Data Science?

Data science is a field that focuses on extracting valuable knowledge and insights from data to solve different problems.

Data scientists do this using a combination of computer science, mathematics, and knowledge of the business area they are analyzing. They also need good communication skills.

Using data, data scientists can:

  1. Tell you what happened in the past (Descriptive Analytics).
  2. Explain why it happened (Diagnostic Analytics).
  3. Predict future outcomes (Predictive Analytics).
  4. Advise the best course of action based on the predicted outcomes (Prescriptive Analytics).

These 4 analytics build upon each other and increase in complexity.

Data Science Lifecycle: with Example

An example of an application of data science is employee retention analysis. This analysis is about how well a company is keeping its employees. We will use this example to explain the data science lifecycle (the steps data scientists take when solving a problem).

Let’s call the company, Company X. These are the steps data scientists take when solving problems:

  1. Problem Definition: work with the customer to understand the business problem and define clear project goals. Domain expertise and curiosity can help data scientists avoid answering the wrong question. In our example, this involves determining what Company X is mainly interested in. Do they want to know their current or past retention levels, or what causes employee turnover in their company? Do they instead want to predict turnover or get recommendations on how they could retain their employees? The data scientists will also ask where they can get data to do their analysis. This leads us to the next step.
  2. Data Acquisition: collect data needed for the analysis. For our example, this may include employee demographics, job titles, departments, years of service, performance scores, salary, employee engagement survey responses, and exit interview data. Additionally, we may consider gathering external data, such as industry benchmarks or economic indicators, if relevant.
  3. Exploratory Data Analysis (EDA) and Data Cleaning: analyze the data to identify and resolve data quality issues and find preliminary patterns. Cleaning involves addressing data quality issues like duplicates and missing values to get the best results. In our example, we clean the data, calculate the overall retention score for Company X and its departments and explore insights. For instance, we may discover that certain departments have higher turnover rates than others. This step provides descriptive analytics, giving us a clear picture of employee retention. If necessary, we then proceed to more advanced analytics in the next step.
  4. Modeling: build machine learning models that can diagnose problems, make predictions, and prescribe the best next steps. For Company X, models could tell us why the company or different departments have been losing or keeping employees, predict future retention rates, and determine what the company could do to improve employee retention. Once we have our insights we share them in the next step.
  5. Presentation and Visualization: present findings or recommendations in a way that the client can understand. Effective communication will help Company X implement a plan to improve employee retention.
  6. Deployment: implement the model or model recommendations and monitor outcomes. In our example, we may implement the predictive model to predict if any future employees are likely to leave or stay. Based on the insights gained, Company X may also implement the prescribed strategies. This could involve adjusting compensation and benefits or improving safety and work-life balance.

Programming

When working with large amounts of data, most of these steps are optimally done using programming. Data scientists most commonly use Python or R programming languages.

For beginners, we advise you to focus on learning one programming language. Once you learn one, it will be easier for you to learn the other language later if needed. If you are unsure of which programming language to pick, stay tuned for a future article on R vs. Python for data science.

Next Steps

We have come to the end of our Introduction to Data Science course. We hope you have gotten a good understanding of what data science is all about. Next, you can take a programming course.