Daily Dose of Data Science – Day 2 – EDA made easy with Pandas Profiling
Welcome to day 2 of the Daily Dose of Data Science series! In this post we will discuss about a very useful framework called Pandas Profiling for doing a thorough exploratory data analysis in Python!
I believe Exploratory Data Analysis (EDA) is one of the most important steps of a data science process. It is all about knowing your data and forming the right hypothesis that helps you to model your data in the right way and then extracting the necessary insights from the data is the most useful way. Sometimes EDA can be challenging and since there are many aspects of the entire EDA process, it is very easy to miss out a key step in the EDA process, thereby missing out some key information about the dataset itself! But don’t worry Jack, Pandas Profiling has got your back!
Pandas Profiling is an open source framework in Python that can help you automatically generate reports after doing a thorough Exploratory Data Analysis on your dataset. It tries to cover almost every single aspect involved in doing a rigorous EDA on your dataset. So instead of focusing on how to do the EDA process, data scientists can focus on consuming the information and forming hypothesis from the report generated and use it for generating key insights or modeling the dataset for solving the underlying task. Before doing a deep dive, I would strongly recommend to visit the GitHub page of Pandas Profiling to get more details than what I have included in this post.
The setup!
The setup is very simple using the pip installer, and nothing explicit needed as the pip installer automatically takes care of all dependencies required by the package:
pip install pandas-profiling
What does it do?
The framework can generate rich and detailed EDA report from any pandas dataframe. You just need to do df.profile_report() and the framework takes care of the rest!
These are the following information that the framework can seamless extract:
- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.
So from detecting missing values, to using statistical methods on structured as well as unstructured data it does it all for you!
Isn’t it neat? Yes absolutely! But there is one small drawback which I have found while using this package. It does take alot of CPU memory and time to process larger datasets! But since it is doing all the manual work for you, it is a very small trade-off and thereby a small sacrifice! Please visit my GitHub repository or the official website of this framework to get your hands dirty and try it out yourself!
Finally save your EDA reports!
You can even save the automatically generated reports to an interactive HTML pages using one simple command:
profile.to_file("D3S_Day2.html")
Well, that’s all for today! I would strongly recommend you to try out Pandas Profiling on your dataset and comment below if it was helpful to speed up your Data Science project or even if you found any issues or drawback while using it! Visit again for getting another daily dose of data science and please feel free to like, share, comment and subscribe to my posts if you find it helpful!