Machine Learning – The Primer – Part 2

 

Just a recap, in my previous post, I had primarily looked upon the basics terms in machine learning and the process used for its implementation. In this post, we are going to continue to learn more about the steps in machine learning process and how it can be enhanced and made more efficient.

There is a huge significance to be on top of Domain knowledge in your said industry/process. This will filter the right data set to be considered for machine learning. The core of this discussion goes around Data and its structure that exists in your organization.

To deduce the data present in an organization and its structure, we pose below three questions:

1.       Is Your Data Tabular?

Traditional machine learning techniques were designed for tabular data, which is organized into independent rows and columns. In tabular data, each row represents a discrete piece of information (e.g., an employee’s address).

There are ways to transform tabular data to work with deep learning models, but this may not be the best option to start off with.

Tabular data can be numeric or categorical (though eventually the categorical data would be converted to numeric).

2.       If You Have Non-Tabular Data, What Type Is It?

Images and Video: Deep learning is more common for image and video classification problems. Convolutional neural networks are designed to extract features from images that often result in state-of-the-art classification accuracies – making it possible to discern high-level differences such as cat vs. dog.

Sensor and Signal: Extracting features from signals and then using these features with a machine learning algorithm. More recently, signals have been passed directly to LSTM (Long Short Term Memory) networks, or converted to images (for example by calculating the signal’s spectrogram), and then that image is used with a convolutional neural network. Wavelets provide yet another way to extract features from signals.

Text:  Text can be converted to a numerical representation via bag-of-words models and normalization techniques and then used with traditional machine learning techniques such as support vector machines or naïve Bayes. Newer techniques use text with recurrent or convolutional neural network architectures. In these cases, text is often transformed into a numeric representation using a word-embedding model such as word2vec.

3.       Is Your Data Labeled?

To train a supervised model, whether for machine learning or deep learning, you need labeled data.

If You Have No Labeled Data

Focus on machine learning techniques (in particular, unsupervised learning techniques). Labeling for deep learning can mean annotating objects in an image, or each pixel of an image or video, for semantic segmentation. The process of creating these labels, often referred to as “ground-truth labeling,” can be prohibitively time-consuming.

If You Have Some Labeled Data

Use transfer learning as it focuses on training a smaller number of parameters in the deep neural network, it requires a smaller amount of labeled data.

Another approach for dealing with small amounts of labeled data is to augment that data. For example, it is common with image datasets to augment the training data with various transformations on the labeled images (such as reflection, rotation, scaling, and translation).

If You Have Lots of Labeled Data  

With plenty of labeled data, both machine learning and deep learning are available. The more labeled data you have, the more likely that deep learning techniques will be more accurate. A typical example is below that illustrates approach when you have too much labeled data.

The steps for you to initiate any Machine learning project is to identify the different steps/tasks as part of any one business process. While one task alone might be more suited to machine learning, your full application might involve multiple steps that, when taken together, are better suited to deep learning. If you have a large data set, deep learning techniques can produce more accurate results than machine learning techniques. Deep learning uses more complex models with more parameters that can be more closely “fit” to the data.

Some areas are more suited to machine learning or deep learning techniques. Here we present 6 common tasks:

We thus look at each of the above tasks and its related examples, its applications, inputs required, common algorithm applied and whether it’s more approached through Machine learning or deep learning.

While one task alone might be more suited to machine learning, your full application might involve multiple steps that, when taken together, are better suited to deep learning. So, how much data is a “large” dataset? It depends. Some popular image classification networks available for transfer learning were trained on a dataset consisting of 1.2 million images from 1000 different categories. If you want to use machine learning and have a laser-focus on accuracy, be careful not to over fit your data.

Over fitting happens when your algorithm is too closely associated to your training data, and then cannot generalize to a wider data set. The model can’t properly handle new data that doesn’t fit its narrow expectations. The data needs to be representative of your real-world data and you need to have enough of it. Once your model is trained, use test data to check that your model is performing well; the test data should be completely new data.

Now that we have witnessed the amount and format of data that is required and how it has to be structured for machine learning or deep learning processes to be implemented. Stay tuned…. Part 3 of this foray, we will look into the details of the tools/techniques and hardware that is required to support the machine learning process while we lead deep into the learning portions of this AI foray we have embarked upon.

Please feel free to review my earlier series of posts on AI-ML Past, Present and Future – distributed across 8 blogs.

Authored by Venugopala Krishna Kotipalli

Leave a Reply

Your email address will not be published. Required fields are marked *