Your Data Strategy Session

Your Data Strategy Session

Machine Learning: Data Analysis and Discovery

Jun 9

Written By Amir Vastani

Excerpts from Qlik AutoML training, pulling out key points that you should do prior to running any Machine Learning (ML) analysis.

Data analysis and discovery

Explore the data

How many observations have you collected

Is this enough to predict on
Are there gaps or nulls in key data points
Do you need to reassess your data collections

Do you understand the distribution of your data

Do you have a normal distribution
Is it skewed? Are there outliers?
What is the range, mean, and median

What is time variant vs non-time variant data

With time variant, is the data time stamped to be aggregated appropriately
Will it be available at the time of the prediction

Review correlations - get early insights into data in order to refine hypothesis

Target correlation

The stratification and/or correlations of the target exist across some of the features

Check for signals

Are the directional patterns in the features to target relationships intuitive

Correlation matrix

Features that are highly correlated to one another may be redundant and a cause for noise, not an additional signal
Consider selecting a single feature from groups that appear to capture the same behaviors in the data
Or else determine if there is a single feature driving both

Apply business knowledge

Is historical data indicative of today's operating environment

Have systems or data collection practices significantly change in the collection window

How does your domain experience explain and validate data

If data doesn't align to your assumptions, it could mean there are data issues or assumptions are off

What additional features need to be collected or engineered

Clean and generalize forms - should try to use a model that generalizes when possible

Remove outliers - which could impede an algorithm ability to discern general patterns in the data.

Get rid of them

Address distribution oddities like skews, tails, multi-modal shapes in your data

May require additional data transformation or future feature design
One hint to group low volume categories and round or remove tails in numeric features

Replace null or missing values

with others or unknown when appropriate in order to gain extra value from a sparse column

Address correlated features

By remove redundant features
Engineering new features to extract additional information

Feature engineering - the process of creating new features from current ones

To gain additional predictive power from source data collected to address a business question
Date feature engineering

Parse date into columns (MM,DD,YY)
Creating segments like seasons, quarters, semesters
Calculating date difference between 2 dates

Others

Gender

Assigning gender based on Mr or Mrs

Creating median income from zipcode and income
Parsing customer address to City, State, Zip

Feature design - reviewing the features in your dataset to determine what possible issues may exist or improvements that can be made

Architecting good features include

Leveraging business acumen - you are the expert. Use it to your advantage
Expressing features in a way that ties to the target
Consider factors like

Should time factor into the future
Does rate of change matter
Should a feature be normalized to account for differences across subsets of data
Do null values mean something

Recognizing data leakage - when does the data that you are using to train an ML algorithm include the information you are trying to predict

Data leakage can lead to false assumptions
Data leakage can lead to model performing better in training vs real world
Can cause false assurance of how well the model actually performs
Prevent data leakage

Pay attention to time constraints included in your identified business questions
All data inside of the training set must be relevant to the time constraints set forth by the business questions
Types of data leakage - result in model performing more poorly in real world vs training

One or more features in the training set include information that wouldn't be known at time of leakage
When one or more features in the training set can be used to derive the target variable you are trying to predict

Do additional analysis if AutoML score > 85% for data leakage

Ways to identify data leakage

High scores - if scores are really high, there might be leakage
Feature importance - if one feature is a lot more important than everything else
Chronological holdout - if this score is drastically lower than cross validation
Logic - will you have the information for records at the time you want to make a prediction. Will the records be the same in 30 days

How to fix data leakage

If you identify a column that should not be used to train a model, then drop the column from being used in the model. Keep in data set
Hold time constant on that feature so that it becomes a good feature (e.g. fixing aggregation)

Ways to prevent data leakage

Having well defined business questions with learning framework following ingredients

Event trigger
Target (value & horizon)
Features
Prediction point

Data Analysis and Discovery

Amir Vastani https://www.brilliantassociatesinc.com

Tableau Cloud Migration Strategy

Setting up Tableau Bridge for Oracle datasources