Wednesday, December 23, 2020

Supervised Learning Introduction


In the last post, I discussed the distinction between Artificial Intelligence(AI) and Machine Learning (ML). I also wrote about the main categories of ML training: Re-enforcement, Unsupervised, and Supervised. In this post, we'll take a deeper dive into Supervised Learning. It is the most commonly used training approach so there are tons of documentation and it's well supported in most ML frameworks. Some of the uses of supervised learning are Recommendation Engines for streaming services, Image Classification for a myriad of use cases including medical diagnostics and facial recognition, and Dynamic Pricing for ride sharing services. In our deeper dive, let's review what comprises Supervised Learning Machine Learning, the data we'll feed into it, and how we can measure its ability to produce the desired results.

What we are going to build?

 
In the software development that I do daily, I combine data or input and instructions or rules to produce a result. For example, I might be writing a method for validating information about a user. The input would be things like the user name, address, and phone number. The instruction would be the rules involved in the validation of the data (rules) like names not being null or empty and phone numbers following some regular expression, etc...



 


However, Machine Learning takes a different approach. Let's say my boss has a crippling fear of robots and I've been assigned a project to identify robots using  photos of everyone entering our building. After all the preliminaries of any good software project, I start with writing a proof of concept. I grab a few images of standard robots:

Forbidden Planet Robby the Robot Black Die-Cast Metal Figure ...1951 ... "Gort" - The Day the Earth Stood Still | video - ww… | Flickr

 

  Looking at these pictures, I might assume robots are grayscale so I could convert my images to an array of RGB vectors, e.g. [0xf0, 0x00, 0x28], I could come up with a rule that rejects any images that contained any vectors that did not contain all the same values. (0xAA, 0xAA, 0xAA]) is okay but if the array contained the vector (0xAA, 0xFA, 0x0A) I know that this is a color pixel and I should reject the image as not being one of a robot. My unit test would then fail on any color image of a robot and I would discard this rule. Next, I might notice that robots seem to follow a human form, perhaps I could write an edge detection method to see if there are pixels that follow a human form.



1951 ... "Gort" - The Day the Earth Stood Still | video - ww… | Flickr == stick-figure – the UX Factor

We can see so many use cases in which this fails. Obviously even if the robot does have some anatomic similarity, there is a fair chance my edge detection algorithm would fail to recognize it. And of course we know that robots are going to follow forms and shapes that in no way resemble a human. I think it's pretty obvious that trying to code some sort of rules that would allow for robot image detection  would be pretty futile.


    

!= Boston Dynamics Begins Selling 'Spot' Robot - ExtremeTech!= stick-figure – the UX Factor


This is where we need to implement Machine Learning. What we'll do is feed data and results into our algorithm or model to output the rules. The data is going to be all the images of robots that we can find as well as images of other things that aren't robots (maybe images that are similar to robots, such as people). We will need to label these images as either robot or non-robot. So the images are our data and the labels are the results. The output is the trained model or the rules. I think this analogy will become more clear as we explore Single Layer networks. The most important part at this level is that we have created a model that hopefully will be able to identify pictures of robots and we were able to do this without trying to come up with some impossible complex combinations of IF ELSE statements. 
 

What are we going to feed what we built?

 
The amount of data that exists today is unprecedented and this has largely contributed to the explosion in machine learning. In Supervised Learning our data can take many different forms. In our example, we used still images but it could be streaming video, text, hand written text, medical scans, a CSV file, or audio. Of course these are just examples of the many data formats to which SL may be applied. In later posts, I'll discuss data in more depth. Right now I'd like to focus on the common aspects of our data. The data must have Features, Labels and there should be a lot of it. Features will be what makes that one data sample distinctive. The easiest way to think about this is a spreadsheet of real estate data. The features would be the columns like Total Square Feet, Number of Bedrooms, Lot Size, etc...In other types of data, the feature may the RGB value of a pixel or a sequence of words. The label is the result that comes from the data. If we were using our real estate data to predict the price for which a property will sell, the label will be the Selling Price column. If we we are creating a model that will sort pictures based on whether they are images of either cats or dogs, each image we train our model with will have an accompanying label of Cat or Dog. The other high level aspect about data in SL is that more is definitely better. One of the leading factors in making models that predict accurately is the amount of data we use in the training process. Of course, this comes at a cost in terms of time because of processing time, and write and read latencies, as well as storage issues. In spite of the increased training times, most ML engineers would still want as much data as possible. Data for SL is a big topic and I will have many more posts with topic such as data augmentation, feature reduction, and much more.
 

How do we know what we built works?

We need to have a way of knowing that the model we built actually works. As in any good software projects, we should have this already defined with stakeholder agreement prior to any coding activities. The easiest metric that can be applied to an ML project is "Does it do the task as well a human?" That can be an easy metric to measure if our task is identifying unauthorized network intrusions by searching server logs. We could look at the results of the how many intrusions it correctly identified and compare with that with how network engineers preformed. It gets a bit more complicated when we ask "did our Natural Language Processor write an earnings report?" and then trying to objectively comparing that to one written by a person. 

In most cases, we will be able to measure accuracy. We will reserve a portion of our labeled data to test how accurately our model can predict a result. In the case of the real estate model, we can feed the test data into our trained model and compare the output (the predicted price) with the actual price for which the property sold. It is important to keep in mind that overall accuracy does not give us the entire view of the performance of our model. If we have a model that predicts whether a tumor was malignant based on scan data, not only do we need to know our overall accuracy but, when it is wrong, we need to know how it's wrong. Let's say the model makes accurate predictions 90% of the time, we might be inclined to say that the model was performing well especially when we find out the human counterparts are only able to make correct diagnoses 87% of the time. But on close inspection we find out that of the wrong prediction 95% of those are false negative (the tumor was malignant but the model said it was not), we also find out that most of the wrong diagnoses made by doctors come from an overabundance of caution and 90% are false positives. The false positives may have financial cost associated with them but they aren't typically life threatening.  

In order to see how well our model performs, we separate our dataset into at least two parts. One will be our training set and the other the test set. The training set is the data that we feed to model during the training phase. At the end of each  training run we can see our performance by looking at the accuracy, the Confusion matrix (a matrix that shows false negatives, true negatives, false positives, and true positives), and F Score (a measurement of accuracy and precision). I'll be discussing the latter two in a later post but for now, with this information, I will be making changes to, or tuning, our model to improve its performance. It is critical that we never use our separate test dataset to train the model. When we are happy with our performance we can then test it by feeding it the test set. Fingers crossed, we meet our performance goals against our test sets, we're done and we can prepare to deploy it. This happy path is highly unlikely; more probable is that we will see significant differences in performances between the two datasets. This is most likely going to be overfitting (another term into which we take deep dive). After seeing these discrepancies, we return to the tuning phase. This is an iterative process but we will always keep our datasets separate.

Summary

There is a lot to unpack here. This blog is about going into great detail with accompanying code examples. What we looked at today was a general idea of Supervised Learning and some associated terms. We saw that while data can come in many forms, the amount of data is key for our model's ability to draw correct conclusions. And lastly, we examined the subject of performance measurement. We understand that that measurement is often nuanced and must be tailored to each project. There is still so much fun stuff to cover and we're going to have a lot of fun as we dive deeper!

 
 

 

 
 


Supervised Learning Introduction

In the last post, I discussed the distinction between Artificial Intelligence(AI) and Machine Learning (ML). I also wrote about the main cat...