Factors Essential To Creating a Data-Set
This is my first article on medium; I am really excited! I started working on this after my team and I decided on exploring web scraping and data extraction.
I started researching the factors essential to create a proper data-set, so the data I scraped would be inciteful. I couldn’t find an article in particular that addressed this issue so I made one.
Special thanks to Olaoluwakiitan Olabiyi who was there to guide me through with her acute writing skills. Breaking complex understanding into simple words. You can check out the data I extracted on kaggle.
Ensure you leave a clap if you find this article inciteful 😊.
So, let’s properly introduce the subject.
Introduction
Data to machine learning is like water to life. The world of machine learning (and artificial intelligence at large) requires data to find the patterns within a specified domain. What we call learning!
However, this data is stored and used for training and the evaluation of machine learning models, resulting to a model capable of performing narrow (a single ) tasks seamlessly.
This article explains the essential factors every beginner data engineer should examine when creating data-sets.
The Purpose of Data-Sets
One could say a data-set is a collection of extracted information or a term used to describe a collection of data (SOURCE; Wikipedia). This data can be passed through learning algorithms to draw conclusions/hypotheses about a domain problem.
As humans, we immediately process most of the information we attain during our daily activities. Our eyes, nose, ears, and skin act as input signals to the brain, performing functions on all the information obtained. Thus, our reactions are the output of the processed data.
Is this the same with machine learning systems? Certainly not!
The data extracted for a machine learning program must be stored first (as a data-set), and then processed to create a model that can generalize to new instances. Only then can a machine learning system mimic what we achieve in an instance.
Bottom line, the fundamental purpose of every data-set is to collect data that can be used to address a domain problem.
Let’s have a look at the factors to consider when creating our data-set
Factors to Consider When Creating a Data-Set
It is important to note that web scraping and other data extraction methods exist to create data-sets. Thus, the goal isn’t to just extract data, but data that address a domain problem**.
If you scrape data for education, that’s fine, but it should follow these pointers if the data must be considered useful.
1. Address a Problem
The very first question to ask when creating a data-set is what problem am I addressing?
A valuable data-set is not one with a volume of data, but one that can address a problem. The volume only matters when a problem has been addressed. How do you know a data-set is addressing a problem?
Well, ask questions!
What machine learning models hope to achieve with data-sets is finding patterns in the data. Consider the House Prices — Advanced Regression Techniques by kaggle. This data-set aims to address the price estimation of houses. It does this by describing the various factors responsible for the cost of a house (the column features).
This data-set is as useful as the problem it addresses. Due to this, it becomes possible to explain the major factors responsible for houses being expensive or cheap. As a result, we can generalize to new instances given the feature columns presented.
By knowing the problem you want to solve, you would have already covered half of the task because every other step listed below depends on this one.
2. Identify the Problems Modelling Task
Is the problem a regression, classification, or an intuitive task? Classification is about predicting a label, and regression predicts a quantity machinelearningmastering.
In most cases, the task you would want to address will be either regression or classification. This is because most tasks fall under either one of these categories. However, clustering tasks are just classification tasks without labeled data (unsupervised learning)
An intuitive task aims to extract understanding through data. It combines regressive and classification qualities, hoping to use statistics to find reason within the data.
Most surveys are intuitive in nature. They take data from customers that tries to understand the effectiveness of their businesses to understand what can be done to improve these businesses.
In this session, you decide on what kind of data best addresses the problem, either quantitative or qualitative data.
• Quantitative data are usually numeric and graphical data interpreted using statistical methods.
• Qualitative data are usually categorical data (words/characters) and are analyzed through interpretations and categorizations.
At the end of this step, you would have given your dataset a definite purpose in line with the problem you wish to address.
3. Source for Information
This is a somewhat simple but essential time-consuming process. It’s at this step you engage in research about the problem you intend to address.
This is where you build more intuition about the problem, getting specifics about the kinds of data that would best address the problem. It is also where you can best determine the data collection/extraction method that would be efficient.
After sourcing for information, you would have had enough intuition about the problem and what kind of data to collect. As well as where to get this data or how to source for them.
4. Structure the Data-Set
Although you might seem ready to get the data now! But take a moment to step back and reckon on how the data should be structured for effective usage.
The structure here implies how the various feature columns should be ordered. You consider if tabular data will be the best form of organizing the data or will the data be best structured in a folder of files, etc.
In most cases, you would have already figured this out when sourcing the information from the previous step.
If the structure quality of the data-set is terrible, the data contained within will be hard to work with, and as such, the effect of all the efforts from the first step will be to no avail.
So it is important to take some time to figure out how the data will be organized.
5. Extract the Data
There are two forms of data gathering. Data Collection and Data Extraction. Basically, both are similar in that they attempt to collect data, but they differ in the methods they are accustomed to gathering data.
Data Collection is gathering and measuring information, most times with software. Data Extraction is where data is analyzed and crawled through to retrieve relevant information from data sources (like a database, websites) in a specific pattern.
For Data Collection, the major methods are:
1. Interviews
2. Questionnaires and surveys
3. Observations
4. Documents and records
5. Focus groups
6. Oral histories
There are numerous tools, methods, and types of Data Extraction. Analyzing them is beyond the scope of this lesson. It would probably form a whole new lesson.
The extraction tool you would need should have been defined during the periods of sourcing for information on the domain problem.
6. Clean the Data-Set
This is the final step after successfully extracting the data. you focus on data cleaning.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. ~tableau
During data cleaning, you want to ensure every form of duplicated, un-formatted or incorrectly formatted, and incomplete data are cleaned properly.
This way, they are well suited for machine learning models to effectively address the problem domain.
Conclusion
We started by introducing the importance of data-sets to machine learning, it’s purpose in helping us mimic the human learning process.
Furthermore, we looked at how to create a data-set. Here is a quick recap on the things to consider:
1. Addressing the problem
2. Identifying the problem modeling task
3. Sourcing for information on the data to be extracted
4. Structuring the format for data-set
5. Extracting the data and then
6. Cleaning it up.
Finally, we talked about how these factors are essential for creating a data-set. With these steps, we’ll have a data-set that is easily interpretable by machine learning engineers and machine learning models.Below are links to two resources to help you get started with the data-set collection process.