The value of data profiling in one line of code

The value of data profiling in one line of code

EXPLORE YOUR DATA AND EXPLAIN A PHENOMENON QUICKLY


Screen+Shot+2019-01-30+at+11.09.26+AM.jpg

FULL DATA PROFILING REPORT AVAILABLE HERE ON SHINY

DATA PROFILING

Data Profiling has largely represented how valuable data is based on quality (i.e., How complete is it? Does it have serious integrity issues? Can statistics be generated from the existing data to explain something of interest or importance?). However, given the fact that modelling has become commonplace, the value of data can also be measured on how well something of interest can be modeled or explained which in many occasions is the reason for “EDA”. Hence, Data Profiling has advanced a bit as it relates to Data Science.

EDA

Exploratory Data Analysis or “EDA” is a standard prior to modelling. In fact, CRISP-DM, or the “Cross-industry standard process for data mining”, includes “Data Understanding” and “Data Preparation”, as key phases of which EDA crosses over into both. In simple terms, EDA consists of Univariate and Multivariate anlaysis, where “Uni-” is a single variable and “Multi-” includes more than one variable crossed in some way. Additionally, there are graphical or non-graphical representations of EDA which translates to a chart or a table.

DATAEXPLORER

I came across this package from a contact who is a fellow consultant and Data Scientist. At first I was skeptical but soon realized that this is the best package ever created for experienced data people. What I mean by experienced is professionals who have done countless EDA manually. Why experienced? Well, DataExplorer produces an automated Data Profiling report including key EDA that one should perform prior to modelling. However, if you haven’t learned why one would perform different types of plots and summaries on certain types of variables (i.e., numeric vs. categorical) and how to interpret these outputs then the report is not easy to interpret and somewhat dangerous in the wrong (inexperienced) hands.

IBM EXAMPLE DATA

The data is called “Sales Win Loss”, described to be used to: “Understand your sales pipeline and uncover what can lead to successful sales opportunities and better anticipate performance gaps”, available here.

ENTER DATAEXPLORER

Using the sample data set with the goal of predicting what resulted in a sales pipeline “WIN” or “LOSS”, a single line of code can tell us a lot. In fact, we can use create_report() and indicate the outcome/target/dependent variable/other miscellaneous term for what we are attempting to explain (“y” or “WIN/LOSS”). This will give us a detailed report of what we are studying with what we are trying to explain in mind.

QUICK TAKEAWAYS FROM THE EXAMPLE DATA

The Data Profiling report provides a number of different outputs. While there are a number of outputs that we would want to look at in more detail such as the correlation matrix (which can be difficult to read when there are many features and categories), the most valuable in this context are the bivariate (involving two variables) analyses. To put some numbers to the difference between a “WIN” or “LOSS” and confirm takeaways, I added summary statistics by “y” using describeby() available in the psych package, a package often utilized for pyschometrics.

Quick takeaways:

  • Wins:

    • are of smaller deal size

    • are lower opportunity amounts (in dollars)

    • the amount of revenue collected from “win” clients in the past two years is higher suggesting wins are more likely to come from return clients

    • change sales stages more often

    • spend less days in sales stages from identified/validating to gained agreement/closing (less time in sales pipeline)

    • and spend less days in “Siebel Stages” from identified/validating to qualified/gaining agreement (less time in sales pipeline)

IN SUMMARY, THE VALUE

In summary, DataExplorer can be used to give you some rich insights very quickly especially if you are interested in explaining a certain outcome (i.e., wins/losses in a sales pipeline).

GitHub: Los-Angeles-Data-Analytics

satRday 2019: April 6, 2019 at UCLA

satRday 2019: April 6, 2019 at UCLA

GitHub/Los-Angeles-Data-Analytics and My "VisualResume"

GitHub/Los-Angeles-Data-Analytics and My "VisualResume"