Machine Learning: Data Analysis and Discovery

Excerpts from Qlik AutoML training, pulling out key points that you should do prior to running any Machine Learning (ML) analysis.

Data analysis and discovery

  • Explore the data

    • How many observations have you collected

      • Is this enough to predict on

      • Are there gaps or nulls in key data points

      • Do you need to reassess your data collections

    • Do you understand the distribution of your data

      • Do you have a normal distribution

      • Is it skewed?  Are there outliers?

      • What is the range, mean, and median

    • What is time variant vs non-time variant data

      • With time variant, is the data time stamped to be aggregated appropriately

      • Will it be available at the time of the prediction

  • Review correlations - get early insights into data in order to refine hypothesis

    • Target correlation

      • The stratification and/or correlations of the target exist across some of the features

    • Check for signals

      • Are the directional patterns in the features to target relationships intuitive

    • Correlation matrix

      • Features that are highly correlated to one another may be redundant and a cause for noise, not an additional signal

      • Consider selecting a single feature from groups that appear to capture the same behaviors in the data

      • Or else determine if there is a single feature driving both

  • Apply business knowledge

    • Is historical data indicative of today's operating environment

      • Have systems or data collection practices significantly change in the collection window

    • How does your domain experience explain and validate data

      • If data doesn't align to your assumptions, it could mean there are data issues or assumptions are off

    • What additional features need to be collected or engineered

  • Clean and generalize forms - should try to use a model that generalizes when possible

    • Remove outliers - which could impede an algorithm ability to discern general patterns in the data. 

      • Get rid of them

    • Address distribution oddities like skews, tails, multi-modal shapes in your data

      • May require additional data transformation or future feature design

      • One hint to group low volume categories and round or remove tails in numeric features

    • Replace null or missing values

      • with others or unknown when appropriate in order to gain extra value from a sparse column

    • Address correlated features

      • By remove redundant features

      • Engineering new features to extract additional information

  • Feature engineering - the process of creating new features from current ones

    • To gain additional predictive power from source data collected to address a business question

    • Date feature engineering

      • Parse date into columns (MM,DD,YY)

      • Creating segments like seasons, quarters, semesters

      • Calculating date difference between 2 dates

    • Others

      • Gender

        • Assigning gender based on Mr or Mrs

      • Creating median income from zipcode and income

      • Parsing customer address to City, State, Zip

  • Feature design - reviewing the features in your dataset to determine what possible issues may exist or improvements that can be made

    • Architecting good features include

      • Leveraging business acumen - you are the expert. Use it to your advantage

      • Expressing features in a way that ties to the target

      • Consider factors like

        • Should time factor into the future

        • Does rate of change matter

        • Should a feature be normalized to account for differences across subsets of data

        • Do null values mean something

  • Recognizing data leakage - when does the data that you are using to train an ML algorithm include the information you are trying to predict

    • Data leakage can lead to false assumptions

    • Data leakage can lead to model performing better in training vs real world

    • Can cause false assurance of how well the model actually performs

    • Prevent data leakage

      • Pay attention to time constraints included in your identified business questions

      • All data inside of the training set must be relevant to the time constraints set forth by the business questions

      • Types of data leakage - result in model performing more poorly in real world vs training

        • One or more features in the training set include information that wouldn't be known at time of leakage

        • When one or more features in the training set can be used to derive the target variable you are trying to predict

      • Do additional analysis if AutoML score > 85% for data leakage

    • Ways to identify data leakage

      • High scores - if scores are really high, there might be leakage

      • Feature importance - if one feature is a lot more important than everything else

      • Chronological holdout - if this score is drastically lower than cross validation

      • Logic - will you have the information for records at the time you want to make a prediction.  Will the records be the same in 30 days

    • How to fix data leakage

      • If you identify a column that should not be used to train a model, then drop the column from being used in the model.  Keep in data set

      • Hold time constant on that feature so that it becomes a good feature (e.g. fixing aggregation)

    • Ways to prevent data leakage

      • Having well defined business questions with learning framework following ingredients

        • Event trigger

        • Target (value & horizon)

        • Features

        • Prediction point

Data Analysis and Discovery
Previous
Previous

Tableau Cloud Migration Strategy

Next
Next

Setting up Tableau Bridge for Oracle datasources