Wednesday, December 23, 2020

Supervised Learning Introduction

In the last post, I discussed the distinction between Artificial Intelligence(AI) and Machine Learning (ML). I also wrote about the main categories of ML training: Re-enforcement, Unsupervised, and Supervised. In this post, we'll take a deeper dive into Supervised Learning. It is the most commonly used training approach so there are tons of documentation and it's well supported in most ML frameworks. Some of the uses of supervised learning are Recommendation Engines for streaming services, Image Classification for a myriad of use cases including medical diagnostics and facial recognition, and Dynamic Pricing for ride sharing services. In our deeper dive, let's review what comprises Supervised Learning Machine Learning, the data we'll feed into it, and how we can measure its ability to produce the desired results.

What we are going to build?

In the software development that I do daily, I combine data or input and instructions or rules to produce a result. For example, I might be writing a method for validating information about a user. The input would be things like the user name, address, and phone number. The instruction would be the rules involved in the validation of the data (rules) like names not being null or empty and phone numbers following some regular expression, etc...


However, Machine Learning takes a different approach. Let's say my boss has a crippling fear of robots and I've been assigned a project to identify robots using  photos of everyone entering our building. After all the preliminaries of any good software project, I start with writing a proof of concept. I grab a few images of standard robots:

Forbidden Planet Robby the Robot Black Die-Cast Metal Figure ...1951 ... "Gort" - The Day the Earth Stood Still | video - ww… | Flickr


  Looking at these pictures, I might assume robots are grayscale so I could convert my images to an array of RGB vectors, e.g. [0xf0, 0x00, 0x28], I could come up with a rule that rejects any images that contained any vectors that did not contain all the same values. (0xAA, 0xAA, 0xAA]) is okay but if the array contained the vector (0xAA, 0xFA, 0x0A) I know that this is a color pixel and I should reject the image as not being one of a robot. My unit test would then fail on any color image of a robot and I would discard this rule. Next, I might notice that robots seem to follow a human form, perhaps I could write an edge detection method to see if there are pixels that follow a human form.

1951 ... "Gort" - The Day the Earth Stood Still | video - ww… | Flickr == stick-figure – the UX Factor

We can see so many use cases in which this fails. Obviously even if the robot does have some anatomic similarity, there is a fair chance my edge detection algorithm would fail to recognize it. And of course we know that robots are going to follow forms and shapes that in no way resemble a human. I think it's pretty obvious that trying to code some sort of rules that would allow for robot image detection  would be pretty futile.


!= Boston Dynamics Begins Selling 'Spot' Robot - ExtremeTech!= stick-figure – the UX Factor

This is where we need to implement Machine Learning. What we'll do is feed data and results into our algorithm or model to output the rules. The data is going to be all the images of robots that we can find as well as images of other things that aren't robots (maybe images that are similar to robots, such as people). We will need to label these images as either robot or non-robot. So the images are our data and the labels are the results. The output is the trained model or the rules. I think this analogy will become more clear as we explore Single Layer networks. The most important part at this level is that we have created a model that hopefully will be able to identify pictures of robots and we were able to do this without trying to come up with some impossible complex combinations of IF ELSE statements. 

What are we going to feed what we built?

The amount of data that exists today is unprecedented and this has largely contributed to the explosion in machine learning. In Supervised Learning our data can take many different forms. In our example, we used still images but it could be streaming video, text, hand written text, medical scans, a CSV file, or audio. Of course these are just examples of the many data formats to which SL may be applied. In later posts, I'll discuss data in more depth. Right now I'd like to focus on the common aspects of our data. The data must have Features, Labels and there should be a lot of it. Features will be what makes that one data sample distinctive. The easiest way to think about this is a spreadsheet of real estate data. The features would be the columns like Total Square Feet, Number of Bedrooms, Lot Size, etc...In other types of data, the feature may the RGB value of a pixel or a sequence of words. The label is the result that comes from the data. If we were using our real estate data to predict the price for which a property will sell, the label will be the Selling Price column. If we we are creating a model that will sort pictures based on whether they are images of either cats or dogs, each image we train our model with will have an accompanying label of Cat or Dog. The other high level aspect about data in SL is that more is definitely better. One of the leading factors in making models that predict accurately is the amount of data we use in the training process. Of course, this comes at a cost in terms of time because of processing time, and write and read latencies, as well as storage issues. In spite of the increased training times, most ML engineers would still want as much data as possible. Data for SL is a big topic and I will have many more posts with topic such as data augmentation, feature reduction, and much more.

How do we know what we built works?

We need to have a way of knowing that the model we built actually works. As in any good software projects, we should have this already defined with stakeholder agreement prior to any coding activities. The easiest metric that can be applied to an ML project is "Does it do the task as well a human?" That can be an easy metric to measure if our task is identifying unauthorized network intrusions by searching server logs. We could look at the results of the how many intrusions it correctly identified and compare with that with how network engineers preformed. It gets a bit more complicated when we ask "did our Natural Language Processor write an earnings report?" and then trying to objectively comparing that to one written by a person. 

In most cases, we will be able to measure accuracy. We will reserve a portion of our labeled data to test how accurately our model can predict a result. In the case of the real estate model, we can feed the test data into our trained model and compare the output (the predicted price) with the actual price for which the property sold. It is important to keep in mind that overall accuracy does not give us the entire view of the performance of our model. If we have a model that predicts whether a tumor was malignant based on scan data, not only do we need to know our overall accuracy but, when it is wrong, we need to know how it's wrong. Let's say the model makes accurate predictions 90% of the time, we might be inclined to say that the model was performing well especially when we find out the human counterparts are only able to make correct diagnoses 87% of the time. But on close inspection we find out that of the wrong prediction 95% of those are false negative (the tumor was malignant but the model said it was not), we also find out that most of the wrong diagnoses made by doctors come from an overabundance of caution and 90% are false positives. The false positives may have financial cost associated with them but they aren't typically life threatening.  

In order to see how well our model performs, we separate our dataset into at least two parts. One will be our training set and the other the test set. The training set is the data that we feed to model during the training phase. At the end of each  training run we can see our performance by looking at the accuracy, the Confusion matrix (a matrix that shows false negatives, true negatives, false positives, and true positives), and F Score (a measurement of accuracy and precision). I'll be discussing the latter two in a later post but for now, with this information, I will be making changes to, or tuning, our model to improve its performance. It is critical that we never use our separate test dataset to train the model. When we are happy with our performance we can then test it by feeding it the test set. Fingers crossed, we meet our performance goals against our test sets, we're done and we can prepare to deploy it. This happy path is highly unlikely; more probable is that we will see significant differences in performances between the two datasets. This is most likely going to be overfitting (another term into which we take deep dive). After seeing these discrepancies, we return to the tuning phase. This is an iterative process but we will always keep our datasets separate.


There is a lot to unpack here. This blog is about going into great detail with accompanying code examples. What we looked at today was a general idea of Supervised Learning and some associated terms. We saw that while data can come in many forms, the amount of data is key for our model's ability to draw correct conclusions. And lastly, we examined the subject of performance measurement. We understand that that measurement is often nuanced and must be tailored to each project. There is still so much fun stuff to cover and we're going to have a lot of fun as we dive deeper!




Monday, November 16, 2020

Machine Learning Introduction

When I start a new subject, I usually begin at a high level before getting into the nitty gritty of the subject. I know as Software Engineers, we are often eager to dive under the hood and get straight to coding but we'll start with a bird's eye view.  Let's begin with defining what Machine Learning (ML) is. First, let's clear up the confusion surrounding the interchangeable use of the terms Artificial Intelligence and Machine Learning, the drivers behind the rise of ML implementation, and the types of Machine Learning

The terms Artificial Intelligence (AI) and Machine Learning are thrown around as if they were the same thing. There really isn't a formal definition of AI or at least a consensus on one. Most data scientist, ML engineers, and academics will agree that ML is a subset of AI. Thus, all ML is AI but not all AI is ML. The big thing Machine Learning algorithms lack is Artificial General Intelligence (AGI). AGI is the idea that if a computer could read a bunch of high school textbooks and, without knowing anything about tests or essay questions, it could apply that data to pass an SAT. There are several theoretical tests like the Turing Test (If you're not familiar with Alan Turing, please stop everything you're doing and read about this great mind). Turing posed a thought experiment where a computer and person were each behind a curtain. If someone conversing with them is unable to discern between the computer and the person, then that computer has attained AIG. Steve Wozniak said if a machine could go into an average home and figure out how to make a cup of coffee (find the machine and the coffee, add water, locate a cup, and work the coffee machine) it could be defined as AIG. From a human point of view, these tasks aren't particularly difficult; from a software engineering perspective, these tasks are impossibly complex. Currently, there are no machines that are capable of passing these, or any similar, tests. Now that we know the difference between AI and ML, lets dive deeper into Machine Learning.

Machine Learning is a set of learning algorithms that can be further divided into three subsets - Reinforcement Learning, Supervised, and Unsupervised. 

Reinforcement Learning takes signals as input, chooses an action based on those signals, and then tries to improve the choice of action based on the reward or punishment that results from that action. Imagine a rat running a maze; a signal is seeing a place to go right or left. Go right, get a treat, go left, get a shock. The next time the mouse is presented with a right/left decision, it will use prior experience to go right. OpenAI has some great resources on Reinforcement training that even include video games upon which your model can be trained. Reinforcement training was used by Google to create their AlphaGO model that beat a top GO master. There is a great article in Nature that documents the development of this game paying AI.

Supervised Learning is probably the most common of all the ML types. It involves the use of labeled data. For example, we want to predict the price for which a house will sell. Suppose we have a spreadsheet that contains housing price data and it contains columns named: "Total Square Feet," "Number of bedrooms, " "Number of bathrooms," and "Selling Price." The first three columns are referred to as Features and the last as the Label. Providing that the spreadsheet is sufficiently large, we can use it to train one of many ML algorithms to predict the selling price for a house not on the spreadsheet. The Supervised name comes from the idea that we have a large set of examples in which we know the outcome that we are trying to predict. We use that labeled data to train an algorithm. There are so many different applications of Supervised Learning that it has become the most used subset of ML. Our labeled data could be a collection of pictures of cats and dogs. In reality, the pictures are m X n matrices of pixel data and the feature set. The label would be the corresponding Cat or Dog identifier. Supervised Learning will be the main focus of upcoming posts.

Unsupervised Learning is used on data that has features but lacks labels. Perhaps we have a bunch of data about customers who have purchased something from our website. Normal demographic data (location, age, income bracket) are so varied among our customers that we are having a tough time marketing to them. We could use a Clustering algorithm to identify different groups that comprise our customer base. Unsupervised learning is also a rapidly expanding area of not only academic research, but also real world ML implementations.
Hopefully, this clears up any confusion surrounding the difference between Artificial Intelligence and Machine Learning as well as the types of Machine Learning that are currently being researched and implemented.

Saturday, May 23, 2020

Deep Learning Monkey Introduction

Welcome to my blog, I'm a Software Engineer who has developed a passion for Machine Learning. This field is exciting and has been capturing the imagination of developers. I began with spending a lot of time completing various courseware classes and in a later posting, I will give my thought on the specific ones of which I've tried.  In addition to online learning, I continue to spend copious amounts of time on blogs and there are countless ones out there.  Which begs the question - Why do we possibly need another one?  While the bulk of online resources and books that I have read on the subject are great and very informative, they seem devoid of the issues that Software Engineers deal with on a daily basis.  I'd like this blog to be a study in Machine Learning from the perspective of a Software Engineer.  I think this has value to the Machine Learning community on the whole.  Software Engineers tend to think about problems before solutions, they think about and write clean code that is well organized and maintainable and lastly they think about how the software will be deployed, configured, and scaled. 

I think it's fair to say that most people have been deluged by reporting about AI or Machine Learning (I'll discuss the difference in the following post). Paradoxically, the sheer volume of media on the subject has probably generated more confusion about the subject.  It reminds of the time I was working for a DoD contractor making frameworks for courseware.  The Courseware Manager called me into his office and asked earnestly "Do we need the Cloud?" It was obvious that he really didn't know what the "Cloud" was but had heard the term from somewhere.  I think there are a lot of medium and small business in the same place as my old manager.  To his credit, he let me explain the term as it applied or didn't apply to our projects and we moved on quickly.  I wonder if the medium/small businesses have a similar approach to that of my manager or how many spend a lot of resources on projects that will have little return.  This is where a Software Engineer's approach really comes in handy.  Typically, SEs are presented with or identify a problem and then they look for a technical solution.  While I have no illusions that Software Engineers don't make mistakes in offering up solutions but I do think the approach is correct and I think most instructive material doesn't emphasize this enough.  So I'll spend more than a few postings focusing on why and not so much as the how.

Many of the online resources and some books include code examples that aren't very well structured.  I think this is mainly due to the context in which the code examples are supplied.  The examples are generated with the aim of teaching some aspect of Machine Learning and not application architecture.  It does lend itself to the mindset that all ML is done in PyCharm with scripts that include little or no classes, methods, or unit tests and riddled with literals.  I'd like to explore how we can implement Machine Learning in well structured code using industry best practices.

Lastly, I see a shocking lack of subject material in the Machine Learning community that deals with much of what comprises Software Engineers daily tasks.  I feel like so much of the material is being generated by people who are not currently working as developers.  Machine Learning as functional code is pretty new.  Ten years ago, most of the work being done in Machine Learning was being carried out at elite Computer Science departments or companies with very large R&D budgets.  It has come a long way from there but outside of large companies, implementation can be fairly haphazard compared to their other software projects.  In addition, roles aren't well defined.  I've read descriptions of a Data Scientist job that really was nothing more than a Business Analyst.  I think there is a lot of confusion about what is Machine Learning and what is statistical inference.  Ultimately, most of this confusion arises from how new the implementation of Machine Learning.  What does seem clear is that there is very little material about ML as it pertains to Software Engineering best practices.  I want to explore how do deploy and maintain Machine Learning tools in Continuous Integration/Continuous Deployment environments.  I say tools because because I also include data pipelines in this.  How do we handle scalability for our projects. How can we verify very large data sets and not bring platforms with them.  I suspect there will be a lot more subjects that I currently discuss here. And I'm sure it's going to provide for lively discussion as well.

So, again welcome to my blog. There are so many great resources for Machine Learning and my intent is for this blog to join that community. I certainly don't wish to demean other blogs, books, videos, etc... I just think there is a definite need to look at Machine Learning through the lens of a Software Engineer. Your comments and feedback will always be welcome,

Thanks and Happy Learning,

Supervised Learning Introduction

In the last post, I discussed the distinction between Artificial Intelligence(AI) and Machine Learning (ML). I also wrote about the main cat...