Future of Work

The Data Scientist versus the Data Analyst

Share this post:

A common question that I get from my students is that: “what is the difference between a Data Analyst and a Data Scientist?”

I would argue that the task at hand differs. It differs because being a data scientist does not make one master of the universe. When we say data science is team-work, it means that the team includes a data journalist who is involved with data collection and data wrangling activities. The data engineer likely works with Python and strives to bring forth meaningful visualizations of the data. The data analyst may perform both of those tasks and is well versed with SQL calls, understands the DBMS that is humming either on premise or in the cloud (and ever so prevalent, hybrid systems). Think Hadoop, big data and data mining skills.

The data scientist is the curious. They are the ones with a pain-point to resolve. The data scientist has a hypothesis to refute or validate (both are helpful). The data scientist ventures out of the office and feels the cold, the rain, takes measurements from the sensors out there.

Unlike the data analyst, the data scientist  (DS) is also keenly involved with unstructured data. This means the DS is extracting insights and sentiment from tweeter feeds, from Facebook images perhaps to depict sudden onset of depression as a result of social distancing. Is that depression more prevalent with one cohort than another? How can I help? These insights are not in IBM DB2, MS SQL Server nor in Oracle data base….these data points are in our mobile devices.

The tools that the DS may use go beyond the brute statistics (Regression, Random Forest Trees, Bayesian Inferences) of SPSS or SAS; they employ deep learning techniques (CNN, RNN, LSTM, capsule networks, GANs) that use feature vectors for input. After all, all data points for a machine to use need to be normalized between 1 and 0. The system does not just see a cat, rather it is 1-hot encoding…it is a bunch of ones and zeros. Same is true if your input was a CSV file.

The reigning attribute that I would like to see in an aspiring data scientist is a sense of curiosity. One who goes around and asks ‘why’ all the time. Another key feature is an understanding of inferential statistics (think regression) and Calculus II (think partial derivatives and integrals). They can clearly see in their minds how an integral function is the opposite of a derivate function.

Python you say? Well it helps, but not on the top of my list. Nowadays, we snag code from existing Jupyter Notebooks and reuse those, perhaps just changing the value of x and y axis in the code. One thing that is a bit more prevalent with DS is use of opensource tools for running the math calculations (NumPy, Sci-Kit learn) and data visualizations (my favorite visualization, is the opensource Pixiedust…born and raised right here at IBM by an ex-IBM Distinguished Engineer, David Taieb.

The data scientist has a keen understanding of the confusion matrix and can interpret the distribution in a Receiving Operating Curve (ROC) where we peg the True Positive (x-axis) versus the False Positive (y-axis).

The data scientist is a scientist because they started with a hypothesis and they employed the scientific method.

The data scientist understands the value of Design Thinking. Realizes that there is such a thing as boiling the ocean water and a keen alignment of for whom exactly are we solving the said problem?

There is a shared task among all these roles that have the word ‘data’ in it: and that is, they all start the week as data janitors. The Harvard Business review that deemed Data Scientist as the sexiest job of the 21 Century forgot to mention that all that allure starts by Thursday, not on Monday. Lots of unglamorous data cleansing needs to be done before machine learning comes int play.

The data scientist understands that the winner of AI race is not the entity or country with epic amounts of data, nor the university or firm with the next big algorithm, it is doing maximum AI with minimum data. For example, I have a hunch it is going to be a good day…how much data did I need first thing in the morning to make that prediction? Good luck to the ML system in making that prediction using a hunch……for now!

More Future of Work stories
By Rav Ahuja on 26 June 2024

Future-Proof Your AI Career with IBM’s Vector Database Fundamentals Specialization

In the rapidly evolving field of AI, vector databases are the engines driving transformative applications like recommendation engines, search information retrieval, machine learning, data analysis, semantic matching, and content generation. With companies investing heavily in these technologies, the demand for professionals skilled in vector databases is skyrocketing. IBM’s Vector Database Fundamentals Specialization on Coursera is […]

Continue reading

By Rav Ahuja on 26 June 2024

Boost Your Project Management Career with Generative AI

The surge in generative AI has required that project managers evolve to integrate these advanced technologies into their work to streamline processes and enhance project outcomes. To help project managers effectively leverage generative AI, IBM has introduced the Generative AI for Project Managers Specialization on Coursera, designed to equip project professionals with the in-demand skills […]

Continue reading

By Rav Ahuja on 29 May 2024

Boost your software development potential with generative AI and prompt engineering

Are you ready to take your software development skills to the next level? Imagine being able to write high-quality code, generate innovative solutions, and boost your productivity—all with Generative AI. All this is possible with the Generative AI for Software Developers Specialization by IBM on Coursera, your gateway to mastering this transformative technology. Why Generative […]

Continue reading