What really Data Science Is?
- Data science is not about making complicated models. Its not about making awesome visualizations.
- It’s not about writing code, data science is about using data to create as much impact as possible for your company.
- Now impact can be in the form of multiple things, it could be in the form of insights, in the form of data products or in the form of product recommendation for a company.
- Now to do these things, you need tools like making complicated models or data visualizations or writing code.
- But essentially as a data scientist, your job is to solve real company problems using data and what kind of tools you use one don’t care.
- Data scientists works for GAFA company (G=google A=apple F=Facebook A=amazon) these companies really emphasis on using data to improve their products.
History of Data Science
Before data science, we popularized the term data mining in an article called from data mining to knowledge discovery in databases in 1996, in which it referred to the overall process of discovering useful information from data.
In 2001, William S. Cleveland wanted to bring data mining to another level, he did that by combining computer science with data mining.
COMPUTER SCIENCE +DATA MINING
Basically he made statistics a lot more technical which he believed would expand the possibilities of data mining and produce a powerful force of innovation.
What is DATA SCIENCE ?
You can take advantage of computer power for statistics and he called this combo data science that is COMPUTER SCIENCE+DATA MINING= DATA SCIENCE.
Around this time, this is also when web 2.0 emerged where websites are no longer just a digital pamphlet, but a medium for a shared experience amongst millions and millions of users.
There are websites like MySpace in 2003, Facebook in 2004, You-tube in 2005, we can now interact with these websites means we can contribute post, comment like upload share leaving our foot-print in digital landscape we call internet and help create and shape the ecosystem we now know and love today.
That is a lot of data so much data, it become too much to handle using traditional technologies so we call this as BIG-DATA.
That opened a world of possibilities in finding insights using data.
But it is also meant that the simplest questions requires sophisticated data infrastructure just to support the handling of data.
We needed parallel computing technologies like Map-Reduce, Hadoop and Spark.
So this rise in big-data in 2010 sparked the rise of data science to support the needs of the business to draw insights from their massive unstructured data-sets.
So then the journal of data science described data science as almost everything that has something to do with data like
Yet the most important part is its applications all sorts of application like Machine Learning.
So in 2010, with the new abundance of data, it made it possible to train machines with a data-driven approach rather than a knowledge driven approach.
All the theoretical papers about recurring neural networks support vector machine became feasible.
Something that can change the way we live and how we experience things in the world.
Deep Learning is no longer an academic concept in these thesis, it become a tangible useful class of machine learning that would affect our everyday lives.
So machine learning and artificial intelligence dominated the media overshadowing every other aspect of data science, like exploratory analysis, experimentation, A/B Testing, Analytics, metrics,ETL and skills we traditionally called BUSINESS INTELLIGENCE.
The general public think of data science as researchers focused ml and ai but the industry is hiring data scientists as analysts.
There is a misalignment there the reason for misalignment is that most of these data scientists can probably work on more technical problems.
But Big companies like Google, Facebook, Netflix have so many low-hanging fruits to improve their products that they don’t require any advance machine learning or statistical knowledge to find these impacts in their analysis.
Being a good data scientist is not about how advanced your models are, its about how much impact you can have with your work.
You are not a data cruncher, you are a problem solver, you are a strategists, companies will give you the most ambiguous and hard problems and we expect you to guide the company to the right direction.
Real Life Examples of Data Science jobs in Silicon Valley
At the bottom of the pyramid we have to collect some sort of data so that we will be able to use that data.
So collect storing and transforming all of these data engineering effort which is pretty important.
It’s actually quite captured pretty well in media because of big-data we talked about how difficult it is to manage all this data.
We talked about parallel computing which means like hadoop and spark, we know about this.
The thing that is less known is the stuff in between which is AGGREGATE/LABEL.
And Surprisingly this is actually one of the most important things for companies because you are trying to tell the company, what you do with your product.
ANALYTICS – That tells you about the using of data what kind of insights can tell me what are happening to my users.
METRICS– This is important because, what’s going on with my product? These matrices will tell if you are successful or not.
And then, you know A/B testing of-course.
Experimentation– It allows you to know, which product versions are the best.
So, these things are actually really important but they are not so covered in media.
What’s covered in the media is the part ai, deep learning, we have heard it on and about it, you know, but when you think about it for a company, for the industry, It’s actually not the highest priority or the least it’s not the thing that yields the most result for the lowest amount of effort.
That’s why AI deep learning is on the top of hierarchy of needs and these things may be testing analytics, they are actually way more important for industry.
So that’s why we are hiring a lot of data scientists that does that.
So what do data Scientists actually do?
Well that depends on the company because of them as of the size.
- So for a start-up, you kind of lack resources.
- So you can only have one kind of data scientist.
- That one data scientist, he has to do everything.
- So you might be seeing all this being data scientists.
- May you won’t be doing ai or deep learning because that’s not a priority right now, but you might be doing all of these.
- You have to start-up a whole data infrastructure.
- You might even have to write some software code to add logging and then you have to do the analytics yourself, then you have to build the metrics yourself, and you have to do A/B testing yourself.
- That’s why for start-ups if they need a data scientist this whole thing is data–science, so that means you have to do everything.
Let us look at Medium-sized companies
- Now finally they have lot more resources, they can separate the data engineers and the data scientists.
- So usually in collection, this is probably software engineering.
- On the second bottom most layer, we have data engineer’s doing this and then depending if you are medium- sized company does a lot of recommendation models or staff that require ai then ds will do all these.
- So as a data scientist, you have to be a lot more technical, that’s why they only hire people with PhD’s or masters because they want you to be able to do the more complicated things.
So let us talk about large company
- Because you are getting a lot bigger, you probably have a lot more money and then you can spend it more on employees.
- So you can have a lot of different employees working on different things.
- That way the employee does not need to think about this stuff that they don’t want to do and they could focus on the things that they are best at.
For example: Me and my untitled large company, I would be in analytics so I could just focus my work on analytics and metrics and stuff like that.
So I don’t need to worry about data engineering or ai deep learning stuff.
So here is how its looks for a large company
Instrumental logging sensors -> This all is handled by software engineer.
Cleaning and building data pipelines-> This is for data engineers.
Then we have data science analytics.
But then once we go to the ai and deep learning, this is where we have research scientists or we call it data- science core and they are backed by and now engineers which are machine learning engineers..
Data Science can be all of this and it depends what company you are in and that definition will vary accordingly.