What is Data Science?
May 1, 2023
DataSciencePursuit
The Definition Of Data Science
Have you ever heard the quote “knowledge is power”? Good-quality data is one of the most precious resources because you can learn a lot from it through analyses.
Data science is a field that focuses on analyzing and drawing valuable insights from data to solve different problems. Data scientists can recognize patterns and make predictions based on past data to inform data-driven decisions.
Core Skills Data Scientists Need
- Domain knowledge is knowledge of the specific subject, field, or business area the data scientist is working on. This knowledge is gained through education or experience, curiosity, and asking the right questions. Domain knowledge helps data scientists understand the problems they are trying to solve. They need to understand the problem to know where to start and to avoid answering the wrong question. Domain knowledge also helps data scientists make sense of the findings from their analysis.
- Statistics is a branch of mathematics focused on effective ways to collect, analyze, and present data. It is a good chunk of the data science process. The discipline is much older than data science, so why not borrow techniques instead of reinventing the wheel?
- Probability tells us how likely it is that something will happen. This is important when making predictions. A higher probability would give people more confidence in the results.
- Programming is the act of giving commands/instructions to a computer for it to complete a task. Vast amounts of data are available through the internet, satellite images, surveys, customer data that companies collect, etc. Computer science, particularly programming, is necessary to analyze this data at scale. We will cover what programming is, in detail, in the last section of this course.
- Machine learning models are computer programs that allow us to use data to “train/teach” computers to identify insights/patterns. Once trained, the models can be used to make predictions. Computer science, together with statistics and probability, produced this great innovation. Data scientists use their knowledge to prepare and provide good-quality data for the machine learning model, just like teachers who give the best information to their students to help them succeed. We will learn more about machine learning in future courses.
- Communication skills are important because data scientists generally need to share their findings with technical or non-technical people. If data scientists cannot communicate their findings well, then no one may be able to use those findings to make any decisions. All that work will go to waste.
Examples of Data Science Projects
Data scientists can work in a variety of industries. Here are a few examples of data science projects:
- Spam filters: to automatically filter out unwanted emails.
- Fraud detection, for example, financial fraud like stolen credit cards or false insurance claims.
- Content recommendation, for example, Netflix shows and YouTube videos you may like.
- Customer or employee retention analysis to help understand why a company may be losing or keeping its customers or employees.
- Car insurance claim prediction.
- Optical character recognition: convert handwritten and scanned files or images to text.
- Object detection: how self-driving cars can identify objects like other cars, stop signs, and pedestrians. This includes knowing what the objects are and where they are located.
Hopefully, you were excited or intrigued after reading these examples. For those who may have felt discouraged, let’s expand on some projects to hopefully demystify them. We will first discuss how anyone could solve these problems on a small scale just through ‘simple’ analysis. This is a good foundation for a data scientist and something to remember. You will find that how we solve some day-to-day problems could be similar to how data scientists do. Then we will emphasize why data scientists may need more tools and knowledge on top of that base skill. We will also explore these projects’ positive impact on businesses and people. Hopefully, this will show you why data science is valuable.
Without further ado, let’s expand on three of these examples.
Spam Filters
If you have an email account, you likely get spam. Last year my sister got a spam email that she didn’t know was spam. She clicked a link, put her password in, and got hacked. After getting her account back, I educated her on what to look for. For those who do not know, this was my two cents (trust me, this will be relevant):
- A general salutation like “Dear Friend, madam, or sir” (they may also use the username part of your email)
- Misspellings and grammatical errors
- Creation of a sense of urgency with various scenarios from immigration to low memory left on your email account
- Anything to do with money or prize winnings, for example, requests for money from strangers or notification that you have won the lottery or some item
- Emails identified as spam before etc
We can figure out all these rules or tells by analyzing emails. Analysis can sound intimidating, but it can be as simple as just reading the emails and spotting any differences between spam and not-spam emails (also known as ham). It helps, of course, if you know which ones are spam and which are ham. We help email companies by reporting emails we believe are spam.
Data Scientists
Data scientists approach the problem similarly. They try to spot the differences between spam and ham emails so they can flag them. The difference is how they do this. If there were only a few emails, manually spotting the differences would be good enough. Lots more people could be data scientists. However, according to Google, over 300 billion emails were sent and received daily in 2021. It would take a long time to go through them manually or a lot of people, which is costly. So, programming knowledge is required. Data scientists train machine learning models to identify these rules and instantaneously filter emails for the users.
So, what positive impact do spam filters have? I’ve gotten 152 spam emails and 60 ham in the past month. Having users filter all those emails by themselves makes for a poor customer experience. The happier the customer, the more they will use emails and the more money email companies make.
Content Recommendation
If a friend asks you to recommend movies or TV shows, what would you do if you wanted to give them good recommendations? If you don’t already know, you may ask them what genre of shows they like and have liked in the past. From this, you can give them a likely personalized recommendation. Or you tell them you have no idea because you do not like the same stuff.
You can also refer them to someone who likes the same shows they like. This is a good analysis process.
Data Scientists
Now let’s say you were a data scientist at Netflix, and you needed to do this for its many customers. This would make it easy for people to find shows they like, so they would keep using Netflix. Which means Netflix continues to make a profit. You would think about this problem in a similar way to the smaller-scale example above. You need a process to go through the large amount of data that Netflix has and developed a working recommendation system relatively quickly. Statistics and computer science can help.
Machines are not smarter than us, but they are much faster when it comes to large data. Using statistical methods, we can teach machines to look at what shows and movies Netflix customers have watched and then use that vast amount of data to recommend shows.
Optical Character Recognition
We are taught to read as kids (if able). We learn by seeing lots of examples. If there are similarities between letters, we spot the differences. In other words, we are looking for patterns. As we get older, we may do this subconsciously.
Data Scientists
The idea is to train/teach a computer to recognize words and letters in a similar way to how people do. This requires knowledge of how computers work, logic and mathematical and statistical knowledge, and a good understanding of how people think.
How computers work: images are made of tiny squares/dots called pixels. See the image below; the pixels are easier to see on curves. Since pixels are small, I needed to zoom the image out quite a bit to see them (for this image, I used Microsoft Paint).

The colors in pixels can be converted to something a computer can understand. Computers can then look at the distribution of colors across the pixels and detect what shape is in the image. With this knowledge, we can tell the computer what letter the shape represents to train it to identify those letters in other images. The more examples we give it, the better it gets at converting images to text. Luckily machine learning models have already been made to direct computers on how to solve this problem. Data scientists will need to prepare and provide good data (in this case, images) for the model.
Once trained, computers can essentially read documents. This means a lot of manual processes can be automated which, can save a lot of time. Optical Character Recognition, combined with text-to-speech technology, can enable computers to read labels, menus, and books to the visually impaired.
Conclusion
Data scientists extract valuable insights from data to solve various problems. They do this using a combination of computer science, statistics, and knowledge of the area they are analyzing.
Next, let’s look at an overview of The Data Science Process.