What is data science? What is data science and how does it work? 

There are different directions in the IT world. Some are involved in administration, others in development or testing. Courses are being created to train system administrators, programmers, and testers. This article will look at a special program - Data Scientist - specifically for developers, analysts and product managers.

Who is a Data Scientist or Data Specialist?

There are a lot of myths surrounding the Data Scientist profession, and many people don’t really understand what it is. Some people think that a data scientist or data analyst is something like a programmer (according to the principle: if you know how to program, then you know how to work with data), some consider this profession to be similar to a database administrator, and some -doesn’t even know what it is.

Looking ahead, it should immediately be noted that a data analyst is not a programmer and certainly not a database administrator, although he is required to have programming skills.

A data scientist is a specialist who has three groups of skills:

  • mathematics and statistics;
  • IT skills, including programming;
  • understanding of business processes in a particular area.

Job openings are not always called Data Scientist. Very often there are options: programmer-analyst, Big Data analyst, systems analysis manager, Big Data architect, business analyst and others.
Some of the responsibilities of a data scientist include:

  • collecting large volumes of data and bringing them into a convenient format;
  • programming in Python, R, SAS languages;
  • solving business problems using data processing methods;
  • searching for hidden connections and patterns in data;
  • carrying out statistical tests.

A data specialist must understand the business needs of his organization and master analytical tools: machine learning and text analytics.
According to the consulting company McKinsey Global Institute, already in next year in the USA (only in the USA, not throughout the world!) a whole army of data specialists will be needed - from 140 to 190 thousand.

How much does a data scientist earn?

In the USA average salary a data scientist makes over $138K per year. In Russia, you can qualify for a salary of 120 thousand rubles per month (more than 26 thousand dollars per year).

If we compare it with the profession of a simple programmer, then in the USA the average salary of a programmer is 65–80 thousand dollars per year, and in Russia 60 thousand rubles per month, or 13 thousand dollars per year.

In any case, having become a data scientist, you can earn more than a programmer.

As you can see, a data scientist is a very promising profession. Firstly, his salary is higher than that of an ordinary programmer. Secondly, there are not many data specialists and the market is experiencing a shortage of specialists, not only in Russia, but throughout the world.

Data Scientist in infographics. The profession is fresh, highly paid and well-known. But what skills should such a specialist have? Let's consider.

Let's talk about skills

A Data Scientist is a generalist covering analytics and information processing. A data scientist understands statistics and programming. Useful, isn't it? The range of capabilities of each individual Data Scientist is a gradation and can move towards coding or pure statistics.

  • Data Analyst based in San Francisco. Some companies actually compare Data Scientists to analysts. The work of such a specialist comes down to extracting information from the database, interacting with Excel and basic visualization.
  • Huge traffic and a large amount of data force some companies to urgently look for the right specialist. They often post advertisements looking for engineers, analysts, programmers or scientists, all referring to the same position.
  • There are companies for which data is a product. In this case, intensive analysis and machine learning will be required.
  • For other companies, data is not a product, but the management or workflow itself is built on it. Data Scientists are also sought for in order to structure company data.

The headlines are full of titles in the style of “The sexiest profession of the 21st century.” We don’t know if this is true, but we do know that a data scientist must understand:

  1. Mathematics and statistics.
  2. Subject area and software.
  3. Programming and database.
  4. Data exchange and visualization.

Let's look at each point in more detail.

Data Scientist and Mathematical Statistics

Development mathematical methods using statistical data is a fundamental part of the work. Mathematical statistics is based on probability theory, which makes it possible to draw accurate conclusions and evaluate their reliability.

1. Machine learning, as a subsection of AI. There is a training program and examples of data with patterns. We form a pattern model, implement it, and get the opportunity to look for patterns in new data using the program.

2. Data Scientist must know statistical modeling in order to test the model with random signals with a certain probability density. The goal is to statistically determine the results obtained.

3. Experimental design. During experiments, one or more variables are changed to see the difference. In this case, there is an intervention group and a control group, due to which the test is carried out.

4. Bayesian inference helps adjust the probability of a hypothesis.

5. Supervised training:

  • decision trees;
  • random forests;
  • logistic regression.

6. Unsupervised learning:

  • clustering;
  • dimension reduction.

7. Optimization: gradient descent and variations.

Domain and software skills

Study and practice! This is the foundation of this specialty. A Data Scientist must have a good understanding of the subject area that science affects, and also be familiar with the software.

The list of required skills is strange, but no less useful:

Programming and Databases

From the basics to knowledge of Python, XaaS, relational algebra and SQL. In general, everything without which attempts to qualitatively process data are useless.

1. Fundamentals of computer science, as a starting point for anyone who connects life with programming and process automation.

I work in automatic natural language processing, an application of data science, and often see people using these terms incorrectly, so I wanted to clarify a little. This article is for those who have little idea what data science is and want to understand the concepts.

Let's define the terminology

Let's start with the fact that no one really knows exactly what data science is, and there is no strict definition - it is a very broad and interdisciplinary concept. Therefore, here I will share my vision, which does not necessarily coincide with the opinions of others.

The term data science is translated into Russian as “data science,” and in a professional environment it is often simply transliterated as “data science.” Formally, this is a set of some interrelated disciplines and methods from the field of computer science and mathematics. Sounds too abstract, right? Let's figure it out.

First part: data

The first component of data science, something without which the entire further process is impossible, is, in fact, the data itself: how to collect, store and process it, as well as how to separate it from the general data array useful information. Specialists devote up to 80% of their working time to cleaning data and bringing it to the desired form.

An important part of this point is how to handle data for which standard storage and processing methods are not suitable due to their huge volume and/or diversity - the so-called big data. By the way, don’t let yourself be confused: big data and data science are not synonyms: rather, the first is a subsection of the second. At the same time, data analysts in practice do not always have to work with big data - small ones can also be useful.

Imagine that we are interested in whether there is any relationship between how much coffee your work colleagues drink during the day and how much sleep they had the night before. Let's write down the information available to us: let's say your colleague Gregory slept for 4 hours today, so he had to drink 3 cups of coffee; Ellina slept for 9 hours and did not drink coffee at all; and Polina slept for all 10 hours, but drank 2.5 cups of coffee - and so on.

Let's depict the obtained data on a graph (visualization is also an important element of any data science project). Let's plot the time in hours on the X axis, and the coffee in milliliters on the Y axis. We'll get something like this:

Second part: science

We have the data, what can we do with it now? That's right, analyze, extract useful patterns and somehow use them. Disciplines such as statistics, machine learning, and optimization will help us here.

They form the next and perhaps most important component of data science - data analysis. Machine learning allows you to find patterns in existing data so you can then predict relevant information for new objects.

Let's analyze the data

Let's return to our example. To the eye, it seems that the two parameters are somehow interconnected: the less a person slept, the more coffee he will drink the next day. At the same time, we have an example that stands out from this trend - Polina, who loves to sleep and drink coffee. Nevertheless, you can try to approximate the resulting pattern with some general straight line so that it approaches all points as closely as possible:

The green line is our machine learning model, it generalizes the data and can be described mathematically. Now, with the help of it, we can determine values ​​for new objects: when we want to predict how much coffee Nikita who entered the office will drink today, we will ask how much he slept. Having received the value of 7.5 hours as an answer, we substitute it into the model - it corresponds to the amount of coffee consumed in a volume of slightly less than 300 ml. The red dot represents our prediction.

This is roughly how machine learning works, the idea of ​​which is very simple: find a pattern and extend it to new data. In fact, in machine learning there is another class of tasks where you do not need to predict some values, as in our example, but divide the data into certain groups. But we will talk about this in more detail another time.

Let's apply the result

However, in my opinion, data science does not end with identifying patterns in data. Any data science project is an applied research, where it is important not to forget about such things as setting a hypothesis, planning an experiment and, of course, assessing the result and its suitability for solving a specific case.

The latter is very important in real business problems, when you need to understand whether the solution found by data science will benefit your project or not. What would be the usefulness of the constructed model in our example? Perhaps with its help we could optimize the delivery of coffee to the office. At the same time, we need to assess the risks and determine whether our model would cope with this better than the existing solution - office manager Mikhail, responsible for purchasing the product.

Let's find exceptions

Of course, our example is as simplified as possible. In reality, it would be possible to build a more complex model that would take into account some other factors, for example, whether a person likes coffee in principle. Or the model could find relationships that are more complex than those represented by a straight line.

We could first look for outliers in our data - objects that, like Polina, are very different from most others. The fact is that in real work, such examples can have a bad impact on the process of building a model and its quality, and it makes sense to process them in some other way. And sometimes such objects are of primary interest, for example, in the task of detecting anomalous banking transactions in order to prevent fraud.

In addition, Polina shows us another important idea - the imperfection of machine learning algorithms. Our model predicts only 100 ml of coffee for a person who slept for 10 hours, while in fact Polina drank as much as 500. Customers of data science solutions will never believe this, but it is still impossible to teach a machine to perfectly predict everything in the world : No matter how good we are at identifying patterns in data, there will always be unpredictable elements.

Let's continue the story

So, data science is a set of methods for processing and analyzing data and applying them to practical problems. At the same time, you need to understand that each specialist has his own view on this area and opinions may differ.

Data science is based on fairly simple ideas, but in practice many non-obvious subtleties are often discovered. How data science surrounds us in everyday life, what methods of data analysis exist, who the data science team consists of, and what difficulties may arise during the research process - we will talk about this in the following articles.

The ability to work with Big Data technologies is a rare and valuable skill that opens up the prospect of becoming a super-in-demand and highly paid specialist.

Alexander Petrov, CTO E-Contenta and head of the mathematics course GoTo Course, talks about how to join this profession.

"The sexiest profession"

As Harvard Business Review wrote a few years ago: “Data Scientist is the sexiest job of the 21st century.”

The article told the story of Jonathan Goldman, a physicist at Stanford who took a job at social network LinkedIn, did something strange and incomprehensible. While the development team puzzles over how to modernize the site and cope with the influx of visitors, Goldman builds a predictive model that tells the account owner who other users of the site might be familiar to him.

By convincing LinkedIn executives to try his new model, Goldman brings millions of new views to the social network and significantly accelerates its growth.

Since then, the Data Scientist profession has not become less sexy, quite the opposite. In 2016, she topped Glassdoor's list of the top 25 jobs in the United States. We will not dwell in detail on why today this profession is considered one of the most highly paid, attractive and promising in the world. Let us only note that the number of vacancies in this direction continues to grow exponentially. According to forecasts by McKinsey Global Institute, by 2018, about 140-190 thousand additional data specialists alone will be needed.

In Russia, the need for data specialists is also growing, although there are still few of them on the market.

It is not surprising that today there are so many people who want to master this profession. Let's figure out who a Data Scientist is and what skills and knowledge he should have.

Who is he, Data Scientist?

In fact, Data Scientist is a profession surrounded by various myths. In the eyes of some, Data Scientists are a kind of shamans capable of “extracting oil”, and they are not required to have any knowledge in the field of business. Others consider almost any programmer to be in this profession: if you know how to program, you know how to work with data.

I prefer the definition given by biological statistician Jeffrey Leak of Johns Hopkins University. A Data Scientist is a specialist with three groups of skills:

  1. IT literacy - programming, inventing and solving algorithmic problems, proficiency in software;
  2. Mathematical and statistical knowledge;
  3. Substantive experience in a certain field - understanding the business needs of your organization or the tasks of your branch of science.

Moreover, vacancies that imply this specialization may have different names. Among the most popular titles are Big Data analyst, mathematician or mathematician-programmer, systems analysis manager, Big Data architect, business analyst, BI analyst, information analyst, Data Mining specialist, machine learning engineer and many others.

How much does a data scientist cost?

Today, only a third of the demand for Data Science specialists can be satisfied. An undersaturated market cannot provide companies with qualified personnel in the field of Data Mining or predictive analytics, which leads to an increase in demand and wages.

In the US, according to O’Reilly Media, Data Scientists’ salaries can reach up to $138 thousand per year and higher, depending on their skill level. For comparison, the average salary of a programmer, according to their estimates, is $65-80 thousand per year.

According to the research center of the recruiting portal Superjob, salary offers for specialists without relevant work experience in Moscow start from 70 thousand rubles, in St. Petersburg - from 57 thousand rubles.

For the next salary level, applicants will be required to have in-depth knowledge of methods statistical analysis data, skills in constructing mathematical models (neural networks, clustering, regression, factor, variance and correlation analyses, etc.), as well as experience in working with large amounts of data and the ability to identify patterns. For such specialists, the salary can reach 110 thousand rubles in Moscow and 90 thousand rubles in St. Petersburg.

Specialists with experience in building commercially successful complex behavior models target audience with the help of deep data mining tools they can count on maximum income. For them, salary offers in Moscow are up to 220 thousand rubles, in St. Petersburg - up to 180 thousand rubles.

Data Science Education: Nothing Is Impossible

Today, for those who want to develop in the field of big data analysis, there are many opportunities: various educational courses, specializations and programs in data science, it will not be difficult to find a suitable option for yourself. You can find my course recommendations.

In my opinion, the best knowledge and skills for working in this area can be obtained in higher education. educational institutions in the following areas: “Applied Mathematics”, “Informatics”, “Mathematical Statistics”.

Because a Data Scientist is a person who knows mathematics. Data analysis, technology and Big Data are all technologies and areas of knowledge that use basic mathematics as their foundation.

Many people believe that mathematical disciplines not really needed in practice. But in reality this is not the case.

Let me give you an example from our experience. At E-Contenta, we focus on recommender systems. A programmer may know that matrix decompositions can be used to solve the problem of video recommendations, know the library for his favorite programming language where this matrix decomposition is implemented, but have absolutely no understanding of how it works and what the limitations are. This results in the method being applied in a suboptimal manner or in places where it should not be applied, reducing the overall quality of the system.

A good understanding of the mathematical foundations of these methods and knowledge of their relationship to real-life concrete algorithms would avoid such problems.

By the way, for training on various professional courses and Big Data programs often require good mathematical preparation.

“What if I didn’t study mathematics or studied it so long ago that I don’t remember anything?” - you ask. “This is not a reason to give up on a Data Scientist’s career and give up,” I will answer.

There are many introductory courses and tools for beginners that allow you to refresh or improve your knowledge in one of the above disciplines. For example, especially for those who would like to acquire knowledge of mathematics and algorithms or refresh them, my colleagues and I have developed a special course GoTo Course. The program includes a basic course in higher mathematics, probability theory, algorithms and data structures - these are lectures and seminars from experienced practitioners. Special attention devoted to analysis of the application of theory in practical problems from real life. The course will help you prepare to study data analysis and machine learning at an advanced level and solve problems in interviews.

We continue a series of analytical studies of the demand for skills in the labor market. This time, thanks to Pavel Surmenok sharky, we will look at a new profession - Data Scientist.

In recent years, the term Data Science has begun to gain popularity. They write a lot about this and talk about it at conferences. Some companies even hire people for positions with the sonorous title Data Scientist. What is Data Science? And who are Data Scientists?

Who are Data Scientists?

If you ask a San Francisco resident this question, you might get the answer that a Data Scientist is a statistician living in San Francisco. Funny, although not very reassuring for those who do not live in San Francisco, right? Okay, then another definition: A Data Scientist is someone who understands statistics better than any programmer, and understands programming better than any statistician. But this option is already close to the essence. Data Scientist, a data scientist, is a kind of hybrid of a statistician and a programmer. Moreover, both statisticians and programmers are very different, so it is better to consider this profession as wide range from pure statisticians to pure programmers.

Robert Chang, Data Scientist from Twitter, divides representatives of his profession into 2 groups: Type A Data Scientist v.s. Type B Data Scientist.

Type A, where A is Analysis. These people are mostly in the business of extracting meaning from static data. They are very similar to statisticians, they can even be statisticians and simply change their job title to Data Scientist, and, as we know, just changing the job title can give a significant increase in salary, plus honor and respect. But in addition to statistics, they also know practical aspects: how to clean data, how to work with large data sets, how to visualize data and describe the results of their work.

Type B, where B – Building. They also have knowledge of statistics, but are also strong and experienced programmers. They are more interested in applying the data to real systems. Models are often built that work in interaction with users, for example, systems for recommending products, films, and advertising.

Data Science also overlaps slightly with areas of activity such as Machine Learning and Artificial Intelligence, representatives of this field are close to Type B Data Science.

What should those who want to become a Data Scientist study, what skills are needed? Let's take a look at what requirements American employers had for candidates for positions in the fields of Data Science and Machine Learning.

Data Scientist Hard Skills

Let's start with an analysis of the requirements for possessing professional skills (hard skills).

As you can see from the ranking, the most popular are fundamental knowledge of mathematics, statistics, Computer Science and machine learning. In addition to theoretical knowledge, a Data Scientist must be able to mine, clean, model and visualize data. Experience in development is also important software and quality management.

Data Science Tools and Technologies

The main tools of a Data Scientist are the Python and R programming languages.

R is a specialized programming language for statistical computing, which is why it is so beloved by statisticians and data scientists. It allows you to quickly load a data set, calculate basic statistical characteristics, visualize data, and build data models.

Python, although a general-purpose programming language, has a huge number of quality libraries and frameworks for Data Science and Machine Learning.

What’s noteworthy is that 39% of vacancies require knowledge of both R and Python at the same time, so it’s better to learn both languages ​​at once rather than try to choose one of them.

To work with big data, employers prefer to use Hadoop and Spark. Popular databases include MySQL and MongoDB.

Data Scientist Soft Skills

General competencies (soft skills) are in less demand compared to professional skills, as they are mentioned in vacancies less than half as often. Average salaries for vacancies that require soft skills are also significantly, approximately 20%, lower than those that require hard skills and knowledge of technology.

However, among the soft skills encountered, the most important are the following: the ability to communicate, visualize data, make presentations, write and speak effectively. Teamwork, management and problem solving skills are also useful.

Data Scientist Domain Knowledge

Some jobs require subject matter knowledge ranging from physics and biology to real estate and hospitality. Here the leaders are economics, marketing and medicine.

Data Scientists Specializations

Before starting the study, we intended to identify the subspecializations of the Data Scientist profession. For example, separate those who primarily engage in data analysis and visualization from those who build predictive analytics models or machine learning algorithms. But, as it turned out during the data analysis, the requirements for most vacancies are quite homogeneous, and there is no clear division into specialties.

Although some patterns seem interesting. For example, if a vacancy requires knowledge of Python or C++, then it is unlikely that it will require communication and management skills, and vice versa.

The impact of technology on wages

The O'Reilly 2015 Data Science Salary Survey gives us a different perspective on the job market. This study is based on a survey of 600 Data Scientists, and the data collected includes salary levels, demographic information and the amount of time data scientists spend on different types of tasks. The key findings of this study are:
  • SQL, Excel, R, Python are the key tools, and this list has not changed for 3 years.
  • Spark and Scala are growing in popularity.
  • The focus of those who previously used specialized commercial tools is shifting to using R.
  • But those who previously used R are switching to Python, Python is leading.
  • Among all industries, salaries in Software Development are highest.
  • Cloud Computing continues to be in demand.
We recommend reading the report in its entirety. Among other things, he describes a mathematical model of the dependence of a Data Scientist’s salary on where he lives, what education he has, and what tasks he works on. For example, Data Scientists who spend more time in meetings earn more. And those who spend more than 4 hours a day studying data earn less.

How to study Data Science?

For recent years There are many online courses on this topic. And it's very good way begin!

If you are leaning more towards data analysis, then a good option is the Data Science specialization courses on Coursera: Launch Your Career in Data Science. The specialization is not free, but if you don't need a certificate, you can take all of these courses for free: just look at the course name and use search to find the course.

For those interested in Machine Learning, we recommend the course by Andrew Ng, Chief Scientist at Baidu Research, who is a part-time lecturer at Stanford and is the founder of Coursera: Computer Learning.

What is Data Science?

Data Science is a new field of activity, so the requirements for Data Scientists have not yet been fully formed. Given the dynamism of our time, it is possible that Data Science will never become an independent profession that will be taught in universities, but will remain a set of practices and skills. But these are exactly the practices and skills that will be in great demand in the coming years.