The thin line between data science and data engineering

Today, as companies have finally come to understand the value that data science can bring, more and more emphasis is being placed on the implementation of data science in production systems. And as these implementations have required models that can perform on larger and larger datasets in real-time, an awful lot of data science problems have become engineering problems.



Figure

 

Editor’s note: This is the fourth episode of the Towards Data Science podcast “Climbing the Data Science Ladder” series, hosted by Jeremie Harris, Edouard Harris and Russell Pollari. Together, they run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

 


 

If you’ve been following developments in data science over the last few years, you’ll know that the field has evolved a lot since its Wild West phase in the early/mid 2010s. Back then, a couple of Jupyter notebooks with half-baked modeling projects could land you a job at a respectable company, but things have since changed in a big way.

Today, as companies have finally come to understand the value that data science can bring, more and more emphasis is being placed on the implementation of data science in production systems. And as these implementations have required models that can perform on larger and larger datasets in real-time, an awful lot of data science problems have become engineering problems.

That’s why we sat down with Akshay Singh, who among other things has worked in and managed data science teams at Amazon, League and the Chan-Zuckerberg Initiative (formerly Meta.com). Akshay works at the intersection of data science and data engineering, and walked us through the fine line between data analytics and data science, the future of the field, and his thoughts on best practices that aren’t getting enough love. Here were our key take-homes:

  • One of the easiest mistakes to make in data engineering is failing to think through your choice of tools. Why are you using S3 as your data warehouse? Why not redshift or BigQuery? Forcing yourself to understand the answers to these questions, and not accept tools as given is a great way to grow, and is mission-critical if you’re going to impress potential employers.
  • Always assume that anything you build now will be replaced in a year or less. Production systems aren’t static, and you or someone else will have to revisit most parts of the codebase sooner or later. That’s why learning how to use docstrings, using clear function and variable names, and understanding best practices around inline comments is so important.
  • Data drift over time, and a model that works well on today’s data may not work as well next week. That can be due to any number of factors: seasonality is one, but user behaviour can also just change on you. Akshay suggests that collecting user feedback on the fly is key to addressing this problem: if you notice that their feedback turns unexpectedly negative, build an alarm into your system that lets you know that something’s not right.
  • The big picture is the most important thing to keep in mind. It’s easy to get lost in a technical problem, but the mark of a great data scientist is the ability to stop and ask whether that problem is even worth solving. Do you really need a recommender system, or would a simple rule-based system work just as well? If you can’t access the exact training labels you would need for a supervised learning model, can you hack together a decent proxy? The real world is messy, and often calls upon you to treat data science problems with more creativity than a Kaggle competition.
  • The importance of seeing the big picture is only increasing over time, as more and more of the data scientist’s workload is being abstracted away through more and more powerful tools. Slowly but surely, data science is becoming a product role.

 

TDS Podcast — Clip

 

If you’re on Twitter, feel free to connect with me anytime @jeremiecharris!

 
Original. Reposted with permission.

Related: