The value of data profiling in one line of code
EXPLORE YOUR DATA AND EXPLAIN A PHENOMENON QUICKLY
Data Profiling has largely represented how valuable data is based on quality (i.e., How complete is it? Does it have serious integrity issues? Can statistics be generated from the existing data to explain something of interest or importance?). However, given the fact that modelling has become commonplace, the value of data can also be measured on how well something of interest can be modeled or explained which in many occasions is the reason for “EDA”. Hence, Data Profiling has advanced a bit as it relates to Data Science.
Exploratory Data Analysis or “EDA” is a standard prior to modelling. In fact, CRISP-DM, or the “Cross-industry standard process for data mining”, includes “Data Understanding” and “Data Preparation”, as key phases of which EDA crosses over into both. In simple terms, EDA consists of Univariate and Multivariate anlaysis, where “Uni-” is a single variable and “Multi-” includes more than one variable crossed in some way. Additionally, there are graphical or non-graphical representations of EDA which translates to a chart or a table.
I came across this package from a contact who is a fellow consultant and Data Scientist. At first I was skeptical but soon realized that this is the best package ever created for experienced data people. What I mean by experienced is professionals who have done countless EDA manually. Why experienced? Well, DataExplorer produces an automated Data Profiling report including key EDA that one should perform prior to modelling. However, if you haven’t learned why one would perform different types of plots and summaries on certain types of variables (i.e., numeric vs. categorical) and how to interpret these outputs then the report is not easy to interpret and somewhat dangerous in the wrong (inexperienced) hands.
IBM EXAMPLE DATA
The data is called “Sales Win Loss”, described to be used to: “Understand your sales pipeline and uncover what can lead to successful sales opportunities and better anticipate performance gaps”, available here.
Using the sample data set with the goal of predicting what resulted in a sales pipeline “WIN” or “LOSS”, a single line of code can tell us a lot. In fact, we can use create_report() and indicate the outcome/target/dependent variable/other miscellaneous term for what we are attempting to explain (“y” or “WIN/LOSS”). This will give us a detailed report of what we are studying with what we are trying to explain in mind.
QUICK TAKEAWAYS FROM THE EXAMPLE DATA
The Data Profiling report provides a number of different outputs. While there are a number of outputs that we would want to look at in more detail such as the correlation matrix (which can be difficult to read when there are many features and categories), the most valuable in this context are the bivariate (involving two variables) analyses. To put some numbers to the difference between a “WIN” or “LOSS” and confirm takeaways, I added summary statistics by “y” using describeby() available in the psych package, a package often utilized for pyschometrics.
are of smaller deal size
are lower opportunity amounts (in dollars)
the amount of revenue collected from “win” clients in the past two years is higher suggesting wins are more likely to come from return clients
change sales stages more often
spend less days in sales stages from identified/validating to gained agreement/closing (less time in sales pipeline)
and spend less days in “Siebel Stages” from identified/validating to qualified/gaining agreement (less time in sales pipeline)
IN SUMMARY, THE VALUE
In summary, DataExplorer can be used to give you some rich insights very quickly especially if you are interested in explaining a certain outcome (i.e., wins/losses in a sales pipeline).