Write down any three difference between data acquisition and data exploration.
Answers
Answer:
Data acquisition: Getting the data either from a primary source (i.e. collecting the data yourself e.g. by tracking custom events in your app, conducting a survey or by running an experiment) or from a secondary source (e.g. purchasing a data set from Bloomberg or downloading it from Kaggle etc.).
Data cleaning: Subset of data preparation. First, removing any issues related to data type (e.g. casting data to the correct data type, remove leading & trailing spaces, checking for unwanted truncations, encoding issues, etc.). Second, detecting and removing duplicates (that might be fuzzy, i.e. not exactly identical) and finding potential inconsistent values. Inconsistencies can be found by defining and applying business rules, e.g. shipping_weight is in range (0.1, 5) or by applying outlier detection methods ( statistical or machine learning approaches) - here is a blog post with a simple explanation on outlier detection.
Data preparation: Includes data cleaning and data integration. Data integration means to combine different data sets, potentially coming from different sources. Challenges here are to find keys for the combination of data sets (i.e. attributes that can be used to match the records). Another challenge might be that those keys could slightly differ in different sources if you do not have a unique ID (e.g. if you match via FirstName+LastName, one table could include a second first name while the other doesn’t). Here is a blog post explaining how to perform data preparation in Excel. For statistical analysis, I would add the process of generating and encoding variables to that definition of data preparation.
Answer:
Data Acquisition:
As the term clearly mentions, this stage is about acquiring data for the project. Let us first understand what is Data. Data can be a piece of information or facts and statistics collected together for reference or analysis. Whenever we want an AI project to be able to predict an output, we need to train it first using data. For example, If you want to make an Artificially Intelligent system which can predict the salary of any employee based on his previous salaries, you would feed the data of his previous salaries into the
machine. This is the data with which the machine can be trained. Now, once it is ready, it will predict his next salary efficiently. The previous salary data here is known as Training Data while the next salary prediction data set is known as the Testing Data.
Data Exploration:
While acquiring data, you must have noticed that the data is a complex entity – it is full of numbers and if anyone wants to make some sense out of it, they have to work some patterns out of it. For example, if you go to the library and pick up a random book, you first try to go through its content quickly by turning pages and by reading the description before borrowing it for yourself, because it helps you in understanding if the book is appropriate to your needs and interests or not.
Thus, to analyse the data, you need to visualise it in some user-friendly format so that you can:
● Quickly get a sense of the trends, relationships and patterns contained within the data.
● Define strategy for which model to use at a later stage.
● Communicate the same to others effectively. To visualise data, we can use various types of visual representations.