Not-so-social-parks: predicting suitable times to visit parks while observing social distancing

Huayi Wei
Insight
Published in
7 min readJun 2, 2020

--

This article was co-authored by Huayi Wei, Isabel Urrutia, Eric Epstein, and Ed Kramkowski.

Among the many frustrations created by COVID-19, there is one question that bothers all park-goers: what is the best time to visit the park and to avoid crowds? Shortly after the lockdown began in New York City, we — four Insight Data Science Fellows — started working on a tool that Brooklynites can use to identify safe times to exercise in Prospect Park.

In the ensuing month and a half, we worked under intense time pressure to develop this time-sensitive idea into the fully functioning web app — not-so-social-parks.space. In this post, we’ll share how teamwork and citizen science made our project possible.

We will walk you through how we explored what data to use for making predictions, and how we gathered data to label our dataset. We will then explain how we stretched our small dataset to generate a workable dataset, and how we performed feature selection and feature engineering. Lastly, we will talk about the design of our front end and back end.

More technical details and our code are available on Github.

Has Google already solved the problem?

The first solution we thought of was the “popular times” feature on Google maps. Can these live reports on popularity tell us if the park is too crowded?

Not quite. In a small “popular times” dataset that we labeled with images taken from the park, we found a large overlap between “safe” and “not safe” data points. One possible explanation is that “popular times” captures motor traffic from the nearby streets, not just foot traffic inside the park. Given that, “popular times” alone will be ambiguous. To build a more accurate model, we decided to incorporate a diversity of data types.

Building our dataset

After some explorations, we discovered that weather, day of the week, and time of the day also have significant effects on park activities. To develop our final model, we supplemented the “popular times” data with data from additional sources.

Our team member Ed set up automatic web scrapers for weather data and popular times data. He housed the data in a SQL database and started assembling the back-end systems that would be the backbone of our model.

However, without any established method to label the park as “safe” or “not safe” for social distancing, we had to develop our own. To get close to the ground truth fast, we leveraged the power of teamwork and took a multi-front approach. We collected three types of data to label our dataset:

  • Images of the park. As we had no access to data from webcams or satellites, we literally had our team member Huayi take photos from her roof that overlooks Prospect Park, every hour from 7 am to 8 pm. We then manually labeled the photos based on how much crowding we observed in the photographs. Although this data was more accurate, she only lasted a week. “I got so fit that week climbing up 4 floors of stairs every hour,” she later recalled.
  • Tweets. Meanwhile, Eric came to the rescue by using Tweepy and GetOldTweets to collect tweets related to Prospect Park. After filtering out bot-generated material, he manually sifted through the hundreds left to find any relevant tweets. Many were entertaining, a few were useful, and a handful, such as this one, were both: “It’s nice for one day and just look at these a**holes not social distancing! Can we please just close the parks, because obviously Brooklyn isn’t getting the message. #stupidbrooklyn #whatcurve @Prospect Park”
  • Survey responses. What exactly constitutes a “safe” condition for social distancing can be a subjective matter. To account for this subjectivity, we turned to citizen science, broadening our input to include live opinions from park-goers. Experienced in survey design, Isabel quickly designed and deployed a questionnaire. Responses started rolling in as we distributed it on Reddit, Twitter, running clubs, and neighborhood Facebook groups.
An example image of Prospect Park taken from Huayi’s roof

Training a model with very limited data

After a month of data-gathering, we obtained a dataset with 157 labeled data points, far from enough. To generate a workable dataset, we assigned each label to a 15-minute time-bin. As crowds in the park take time to build up and disseminate, we assumed that one time-bin before and two time-bins after a labeled time-bin had the same label. We then filled any missing time-bins with averages of the surrounding labeled time-bins. The final dataset reached 515 labeled time points from 7am-8pm across 31 days.

Fill the missing time-bin with the averaged labels of adjacent time-bins

Feature selection on this small dataset was not a straightforward task. The weather data alone had 30 highly correlated features. The changing seasons also meant that a model trained with absolute weather data would adapt poorly as NYC transitioned from March to June.

Keeping these concerns in mind, we carefully distilled 28 of the 30 weather features into 3 new categorical features: “good”, “bad”, and “maybe”. “Good” weather is a clear sunny day, inviting people to embrace the outdoors. “Bad” weather is an overcast or stormy day, keeping people inside. These intuitive categories not only capture the essence of our data, but are also robust to seasonal changes. We left 2 weather features untouched: temperature and wind-speed.

Our final model was a Random Forest Classifier trained on a dataset with only 8 features, including 5 weather features, the “Popular Times” data, the day of the week, and the hour of the day.

AUC: 0.858

A user-friendly front end

To make our web app accessible to the public, we laid out our results sequentially, with the most actionable takeaway at the top of the page. A “thumbs up”/ “thumbs down” represents our current prediction on whether it’s safe to work out in the park. This is followed by a heatmap that visualizes the average risk levels per hour per day, to help those who want to plan ahead. Other data, such as more detailed predictions and tweets, are hidden under toggle buttons, easily available to any curious individuals.

An efficient pipeline

Under the hood, our project is powered by AWS. EC2 instances and CRON jobs are the master commanders. All data stream into a PostgreSQL database. Every 15 minutes, the latest data is pulled into an S3 bucket, which the model uses to spit out a new prediction to the same S3 bucket. The front end fetches the new prediction and updates.

Continuously monitor model performance

One factor that our static model couldn’t capture was ever-changing human behavior. The ups and downs of COVID-19 policies can easily disrupt the implicit relationships between the data and the predictions.

To monitor model performance, we once again turned to the public for help. On our web app, we constantly collect user feedback on the accuracy of our model. The feedback is piped to the backend, compared with our standing predictions, and used to retrain our model.

Conclusion

Hard times call for strong teamwork. Our collaborations enabled us to simultaneously collect data from multiple sources, and parallelize the construction of our front end and back end. As we pooled our wisdom and skills to generate creative solutions, we accelerated towards our product and completed it on time. But in the end, our team was not limited to the four of us. Without civic-minded Brooklynites, we wouldn’t have quickly gathered our data. Without the Insight community, our product would have lost the precious feedback at every step along the way. Hard times demand everyone to work as a team.

Interested in seeing more content like this? Sign up for our newsletter, and get updates on the latest resources and upcoming events.

Authors

Huayi Wei is a big fan of Prospect Park. She holds a Ph.D. in Neuroscience. At Insight, she built a BERT-based named-entity-recognition model to automate the process of documenting art auction records. She is a brilliant communicator with a background in science writing.

Isabel Urrutia is a Ph.D. candidate in Geography & Planning, and holds a Master of Mathematics, a Master of Environmental Studies, and a Bachelor of Mathematics. At Insight, Isabel consulted for a workplace analytics company, and built out their core product by adding features that incorporate recommendations using linear optimization.

Eric Epstein earned a Ph.D. in Philosophy and, before that, a B.A. in Mathematics and Philosophy, all from Cornell University. As an Insight Fellow, he consulted for a major beer manufacturer, helping them sell more beverages through their smartphone app by using gradient boosted trees to identify better ways to engage app users.

Ed Kramkowski is a Ph.D. candidate in Physics. At Insight, he built a recommendation system that helps podcasters grow their audience by analyzing podcasts’ audience interests and recommending ideal advertising and collaboration strategies.

--

--