MrGrey
щас увидела в соц сетях, ну прям как ты сказал. чистка даты ето большая часть работы.
What I am learning now is that most of a data scientist's time is spent on pre processing/cleaning the data.Getting the right data in the right format is the main challenge. you might have to go through hundreds of tables to select the data you need. Most cases the features may be in different formats.
Taking missing values as example,getting rid of all missing values is not always an option if that will amount to significant data loss. There are many ways to deal with it and one way is to replace missing numerical fields with something like -999.
Similarly, different approaches are needed to handle duplicate data,categorical data etc depending on the problem.
You can do the cleaning in SQL then import to Python/R or you can import raw data and clean it in Python/R depending on your preference.Personally I like to do it in Python/R as I find it easier.
Once all data is cleaned and ready and each feature engineered, then features are merged together one by one. This is also critical and if the features are not engineered properly then the merge will fail.
All the fun part of visualization and prediction comes after this stage.
Probably 70-80% of a data scientist's time is spent getting the data ready and this is the hardest part.
щас увидела в соц сетях, ну прям как ты сказал. чистка даты ето большая часть работы.
What I am learning now is that most of a data scientist's time is spent on pre processing/cleaning the data.Getting the right data in the right format is the main challenge. you might have to go through hundreds of tables to select the data you need. Most cases the features may be in different formats.
Taking missing values as example,getting rid of all missing values is not always an option if that will amount to significant data loss. There are many ways to deal with it and one way is to replace missing numerical fields with something like -999.
Similarly, different approaches are needed to handle duplicate data,categorical data etc depending on the problem.
You can do the cleaning in SQL then import to Python/R or you can import raw data and clean it in Python/R depending on your preference.Personally I like to do it in Python/R as I find it easier.
Once all data is cleaned and ready and each feature engineered, then features are merged together one by one. This is also critical and if the features are not engineered properly then the merge will fail.
All the fun part of visualization and prediction comes after this stage.
Probably 70-80% of a data scientist's time is spent getting the data ready and this is the hardest part.
Comment