Learning to Think Like a Data Scientist — Part 2

Esther Richler
Insight
Published in
13 min readNov 10, 2020

--

A field of poppies with a few close up poppies in the foreground and a sliver of blue sky in the background.

Originally published in The ML Rebellion.

In Part 1, I described the planning phase of a data project. The purpose of the planning phase is to identify a business need and develop a roadmap for how to meet it. In this post, I will cover the project execution phase, i.e. putting the plan into action. The goal of this phase is to create a minimum viable product capable of meeting the business need.

Project Execution Phase

The steps in this phase are not as clear cut as those in the planning phase, because they are more closely related to the individual project needs. To keep things generalizable across projects, I have broken down the workflow into three broad and loosely defined steps:

  1. Data Preparation
  2. Machine Learning
  3. Communication

Each of these steps has several possible tasks, some of which I will outline throughout this post. These tasks are project-specific, so they can be moved around, eliminated, or added to as needed.

This is not a deeply technical post, so the ML descriptions are high-level. I’ve included links to deeper write-ups for those wishing to learn more. I also link to code when appropriate. As in Part 1, I highlight how the decision making process is shaped by the business needs. But unlike in Part 1, here I focus more on how to think like a scientist, and how to be critical and ask the right questions.

1. Data Preparation Step

This is a labor intensive step and requires a lot of mental flexibility. In this step, you access and explore the data and then come up with an analysis strategy. This is also the time to diagnose any data issues and figure out how to solve them. Because this is your first chance to really get into the details of the data, in this step you might discover that you need to pivot your project. This is my favorite step because I enjoy creative problem solving.

1.1 Access and explore the data

I had pinpointed satellite data as the most appropriate source of poppy information for this project. Fortunately, there are many satellites to choose from, so I had the flexibility to select the best one for the job. The main aspects I took into consideration were the spatial resolution of the images, how often the satellite collects data, and data accessibility. I chose the Sentinel-2 satellite because it was the best fit overall for my needs.

The pixel resolution of images acquired by Sentinel-2 ranges from 10 to 60 meters, which for satellite images is considered medium resolution. This is high enough to detect patches of poppies in a field, but low enough to keep file sizes manageable. Sentinel-2 collects data from each location once or twice per 10 day period. While I would have preferred something a little more frequent, poppy season lasts for several weeks, so this frequency is sufficient for monitoring bloom sizes over that period. Sentinel-2 data is free and accessible in a variety of formats, and I used the Sentinel hub API to download the data directly to my hard drive.

I started out by looking at data acquired during a poppy bloom at the Antelope Valley Poppy Reserve. This served as my ‘positive control’ to make sure I could see blooms when they were present. Sentinel-2 collects data from 13 different bands of light, and I used the three visible bands (red, green, and blue) to reconstruct natural color images. The orange blooms were clearly visible against the background patchwork of fields, roads, and buildings.

Satellite images of the Antelope Valley Poppy Reserve before, during, and after the 2019 poppy season

The remaining light bands collected by Sentinel-2 are outside of the visible range, but they contain information related to things like moisture, clouds, vegetation, geological makeup, and urban development. This information can be extracted by combining the bands in various ways.

1.2 Develop an analysis strategy

With access to the key data, it was time to come up with a strategy for translating that data into business value. I had to first think about what kind of metric I was going to present to the user, and then think about a path I could take to extract that metric from the data. I would also have to diagnose any data issues that might get in the way, and figure out how to fix or get around them.

I had already decided (during the project planning phase) that the output of my minimum viable product would be bloom level (low, medium, or high). To be able to measure the extent of a bloom, I first needed a way to identify which pixels in the image contained poppies and which did not. One way to do this is to use color thresholding to select all the orange pixels. However, this proved problematic because other objects in the image (dead vegetation, buildings, soil, etc.) could have a similar orange color. Also, the intensity and shading of the orange color varied for poppies at different locations and times. I could, however, move beyond the limits of the human eye by taking advantage of information from the invisible satellite bands. More specifically, I could use the data from all 13 satellite bands as features in a model to predict the presence or absence of poppies in each pixel. In short, using supervised machine learning to classify each pixel would let me get a count of how many poppy pixels were in an image.

To do this, I would need a source of labelled data to train a classifier on. The US Department of Agriculture has a tool called CropScape, which provides labelled satellite data for various types of crops and agricultural products. But flowers are not included in the CropScape data, and I could not find a comparable dataset for non-agricultural plants. This was a major technical hurdle. If I had no labelled data, how would I classify the poppy pixels and calculate the size of a bloom?

1.3 Pivot if necessary

Solving this problem was crucial to being able to meet the guiding need, and this became the new focus of the project. But now the project was too large to complete in the time I had. I would have to find a way to reduce it while still maintaining its value. My previously defined minimum viable product offered two functionalities: current bloom status and predicted bloom status. But the real value of the product came from its ability to provide these bloom reports for non-park locations. This made me realize that simply providing the current bloom status for a non-park location was sufficient to meet the need at its most minimal level. The prediction functionality could always be added at a later time. It was not needed for the minimum viable product. My updated minimum viable product would provide current bloom status for three California locations for the 2020 bloom season. This was a proof-of-concept product, so I would also include the ability to go back in time to see what the results would have looked like during the 2017, 2018, and 2019 poppy seasons.

This change was easy because I had prepared for it during the planning phase. I had made sure that the guiding need was not too narrow yet still clearly defined. I had also come up with enough specific product details to allow for future flexibility. Actually, the change I was making here was more of a detour than a pivot, but these types of preparations are also helpful when a larger change or pivot is needed, or when the pivot occurs at a different time point in the product life cycle. In short, anticipating pivots by preparing for them minimizes the need for additional work later on.

1.4 Update the strategy

The problem at hand was figuring out how to classify data that was unlabeled. Unsupervised learning is commonly used to find patterns or clusters in unlabeled data. But I already had some prior knowledge of what I was looking for, and it seemed inefficient to disregard that information. I knew that each pixel fell into one of two classes: poppy or non-poppy. I further knew that all of the poppy pixels were orange, but that sometimes non-poppy pixels were also orange. Perhaps I could harness some of that information to manually label a small fraction of the data, and then use semi-supervised learning to predict the classes for the rest of the data. However, this would heavily bias the model towards using color as a predictor, which is what I was trying to avoid in the first place. I decided to get around this problem by taking an active learning approach.

Active learning is a form of semi-supervised learning where prediction mistakes are corrected in an iterative process by an outside decision maker (in this case me). It is used in situations where there is a large amount of unlabelled data and it is too expensive to manually label each data point. On each iteration, a small number of highly informative data points are carefully selected and hand labeled, and the model is updated and retrained. Because all of the labeled data points are so informative, relatively little training data is needed, and model performance improves quickly. At the end of the process, the model is good enough to generate high quality class predictions for the entire dataset. This was exactly what I needed.

2. Machine Learning Step

In this step you finally get to put all your plans into action and flex your machine learning muscles. As in the data preparation step, the tasks in this step are flexible and can change depending on results. The first task is to test out your analysis strategy on a small subset of the data. If that goes well, it is time to process the rest of your data. If it goes poorly, or produces unexpected results, you will need to troubleshoot or pivot. Pivoting at this late stage is not the end of the world if you are sufficiently prepared. Another task that may be needed in this step is adding more data to your dataset.

2.1 Execute the analysis strategy on a small-scale

Ideally, I wanted to build a model that could recognize poppy pixels in images from lots of different settings and locations, but for my minimum viable product I limited the training data to three locations. I used two images from each location, one taken during poppy season and one taken outside of poppy season. Including off-season images was one way to ensure that the model had sufficient non-poppy pixels to learn from. I also limited the training data to images acquired during the 2019 poppy season. To ensure that the model could generalize to unseen data, I would use the 2019 data for testing and training, and use images of blooms from other years for validation.

I started by hand labeling a small percentage of pixels as either poppy or not-poppy. For the poppy pixels, I labeled a few of the easily identifiable (i.e. orange to my eye) pixels from the on-season images. For the non-poppy pixels, I labeled an equal number of randomly chosen pixels from the off-season images. These labeled pixels would serve as the seed data in the initial model.

Pixels hand labeled as poppy or non-poppy

To choose a model, I tested 4 different classifiers (logistic regression, naive bayes, random forest, and support vector machine) and calculated their precision and recall. I went with the random forest model because it had the best out-of-the-box performance metrics. This was admittedly an unsatisfying way of making this choice, but it was sufficient for the minimum viable product. If I ever decided to take this product further, I would want to put more thought into choosing and tuning a model. But now was not the time for optimization, and I had to hold myself back from spending too much time on it.

I initialized the active learning process by training and testing a random forest model on the hand labeled seed dataset. I then used that model to predict the rest of the pixels in the images. Next I looked for areas where the model was unsure of its predictions, i.e., the prediction probabilities were around 0.5, and I hand labelled a subset of those pixels and added them to the pool of hand labeled pixels. Finally, I used the updated pool of labeled pixels to build a new model in the next iteration of the process.

As a quick sanity check of model performance on each iteration, I visualized the locations of the predicted poppy pixels and compared those visualizations side-by-side with the natural color images. The off-season images were really useful at this point — large amounts of predicted poppy pixels in these images would have alerted me to a problem with the model. Fortunately the labelling process went well and that did not happen. After just three iterations, I had a model that could confidently and correctly predict the labels of pixels it hadn’t seen before.

Certainty of pixel predictions (p_poppies = predicted probability that class is poppy)

Visualizing the predictions at each step also helped me choose which pixels to hand label. In some cases, when the model was unsure of its predictions, it was very clear to me what the correct label should be. For example, pixels that encompassed roads were very obviously not poppy pixels, so I selected them and hand labeled them as non-poppy (see arrows in above figure). But sometimes I was also unsure of the correct label. An example of this is the large yellowish patch in the upper right quadrant of the true color image above. The model seemed to think that some of the pixels in that area had poppies, but it was not sure about all of them. In this case, I did not correct the model or update the labels because it was not clear to me if there were poppies in that area. This could very well have been a case where the model performed better than my eyes because it had access to more information. If this was a higher stakes project, and it was important to get all the labels right, I would probably want to consult with experts in satellite imagery or botany to help me identify those pixels.

2.2 Augment the dataset and process the new data

My next steps were to build up a full dataset for use in the app, and then classify the pixels in each additional image. I used the Sentinel hub API to download satellite images from March and April of 2017–2020 for three California locations (Lake Elsinore in Riverside County, and Antelope Valley and Grass Mountain in Los Angeles County). I made sure to exclude images that had too much cloud cover in them. I then used the model that I had generated in the previous step to classify all of the pixels in the newly downloaded images.

I now had all of the information I needed for my minimum viable product. All that remained was to wrap it up in an interactive and user-friendly framework.

3. Communication Step

In this step, you set up the process to turn your results into a bottom line actionable insight for the user. Tasks in this step include creating a user friendly metric, and building and deploying the app or platform.

3.1 Translate results into a metric

Now that I could generate class predictions for all pixels in an image, I had to translate this information into a metric that could easily communicate bloom size to the user. To do this, I first converted the images into binary arrays where the value of each pixel reflected its label. I then used image segmentation techniques to get individual bloom patches, and extracted the area of each bloom patch in square pixels. Finally, rather than using the 3 level scale I had originally planned, I converted the patch sizes from square pixels into something more relatable: football field units.

3.2 Create and deploy the product

The final task was to convert the data and results into a product form that the user could interact with. For a small project like this one, this is relatively straightforward. I used Streamlit to create a user facing app, deployed the app to the web with AWS, and uploaded the code to a GitHub repository. As a finishing touch, I gave the app a punchy and memorable name — ‘kaBLOOM’.

Product logo

My app was now functional and I finally had a minimum viable product!

Working through the steps of this project gave me valuable insights into what it takes to be a good data scientist. It is not enough to simply understand machine learning techniques, nor even to know how to apply them. Real data science is an art. It is an art in balancing rigor and thoroughness with business needs and bottom-line results. It is an art in recognizing when to be critical and when to be tolerant. And it is an art in setting and meeting goals that align with the larger context of the work. I hope these blog posts have helped shed light on some of these less easily defined aspects of learning to think like a data scientist.

Hiring managers — I’m on the job market and actively interviewing for Data Scientist positions. Reach out by email or on LinkedIn if you think I might be a good fit for a role you are looking to fill.

Are you ready to make a change & transition to a career in tech? Sign up to learn more about Insight Fellows programs and start your application today.

--

--

Data scientist | Communicator | Former neuroscientist | Lover of anything word related