Why One-Hot Encode Data in Machine Learning?

Updated:06/10/2021 by Computer Hope

Categorical data is a collection of information that is divided into groups. I.e, if an organisation or agency is trying to get a biodata of its employees, the resulting data is referred to as categorical. This data is called categorical because it may be grouped according to the variables present in the biodata such as sex, state of residence, etc.

    Some examples include:
  • A “pet” variable with the values: “dog” and “cat“.
  • A “color” variable with the values: “red“, “green” and “blue“.
  • A “place” variable with the values: “first”, “second” and “third“.
  • How to Convert Categorical Data to Numerical Data?

    • This involves two steps:
    • Integer Encoding
    • One-Hot Encoding


    One Hot Encoding

    For example :
    Consider the data where fruits and their corresponding categorical value and prices are given.
    BikeCategorical value of BikePrice
    ktm1100
    ninza3200
    suzuki4300

    The output after one hot encoding the data is given as follows,
    ktmninzaSuzukiprice
    100100
    010200
    001300

    finally

    , you’re playing with ML models and you encounter this “One hot encoding” term all over the place. You see thesklearn documentationfor one hot encoder and it says “ Encode categorical integer features using a one-hot aka one-of-K scheme. " It’s not all that clear right? Or at least it was not for me.