The variation series is built according to. Grouping data and constructing a distribution series

As a result of mastering this chapter, the student must: know

  • indicators of variation and their relationship;
  • basic laws of distribution of characteristics;
  • the essence of the consent criteria; be able to
  • calculate indices of variation and goodness-of-fit criteria;
  • determine distribution characteristics;
  • evaluate the basic numerical characteristics of statistical distribution series;

own

  • methods statistical analysis distribution rows;
  • basics of analysis of variance;
  • techniques for checking statistical distribution series for compliance with the basic laws of distribution.

Variation indicators

In the statistical study of characteristics of various statistical populations, it is of great interest to study the variation of the characteristic of individual statistical units of the population, as well as the nature of the distribution of units according to this characteristic. Variation - these are differences in individual values ​​of a characteristic among units of the population being studied. The study of variation is of great practical importance. By the degree of variation, one can judge the limits of variation of a characteristic, the homogeneity of the population for a given characteristic, the typicality of the average, and the relationship of factors that determine the variation. Variation indicators are used to characterize and organize statistical populations.

The results of the summary and grouping of statistical observation materials, presented in the form of statistical distribution series, represent an ordered distribution of units of the population under study into groups according to grouping (variing) criteria. If a qualitative characteristic is taken as the basis for the grouping, then such a distribution series is called attributive(distribution by profession, gender, color, etc.). If a distribution series is constructed on a quantitative basis, then such a series is called variational(distribution by height, weight, size wages etc.). To construct a variation series means to organize the quantitative distribution of population units by characteristic values, count the number of population units with these values ​​(frequency), and arrange the results in a table.

Instead of the frequency of a variant, it is possible to use its ratio to the total volume of observations, which is called frequency (relative frequency).

There are two types of variation series: discrete and interval. Discrete series- This is a variation series, the construction of which is based on characteristics with discontinuous change (discrete characteristics). The latter include the number of employees at the enterprise, tariff category, number of children in the family, etc. A discrete variation series represents a table that consists of two columns. The first column indicates the specific value of the attribute, and the second column indicates the number of units in the population with a specific value of the attribute. If a characteristic has a continuous change (amount of income, length of service, cost of fixed assets of the enterprise, etc., which within certain limits can take on any values), then for this characteristic it is possible to construct interval variation series. When constructing an interval variation series, the table also has two columns. The first indicates the value of the attribute in the interval “from - to” (options), the second indicates the number of units included in the interval (frequency). Frequency (repetition frequency) - the number of repetitions of a particular variant of attribute values. Intervals can be closed or open. Closed intervals are limited on both sides, i.e. have both a lower (“from”) and an upper (“to”) boundary. Open intervals have one boundary: either upper or lower. If the options are arranged in ascending or descending order, then the rows are called ranked.

For variation series, there are two types of frequency response options: accumulated frequency and accumulated frequency. The accumulated frequency shows how many observations the value of the characteristic took values ​​less than a given one. The accumulated frequency is determined by summing the frequency values ​​of a characteristic for a given group with all frequencies of previous groups. The accumulated frequency characterizes the proportion of observation units whose attribute values ​​do not exceed the upper limit of the given group. Thus, the accumulated frequency shows the proportion of options in the totality that have a value no greater than the given one. Frequency, frequency, absolute and relative densities, accumulated frequency and frequency are characteristics of the magnitude of the variant.

Variations in the characteristics of statistical units of the population, as well as the nature of the distribution, are studied using indicators and characteristics of the variation series, which include average level series, average linear deviation, standard deviation, dispersion, coefficients of oscillation, variation, asymmetry, kurtosis, etc.

Average values ​​are used to characterize the distribution center. The average is a generalizing statistical characteristic in which the typical level of a characteristic possessed by members of the population being studied is quantified. However, there may be cases where the arithmetic averages coincide when different character distribution, therefore, as statistical characteristics of variation series, the so-called structural averages are calculated - mode, median, as well as quantiles, which divide the distribution series into equal parts (quartiles, deciles, percentiles, etc.).

Fashion - This is the value of a characteristic that occurs in the distribution series more often than its other values. For discrete series, this is the option with the highest frequency. In interval variation series, in order to determine the mode, it is necessary to first determine the interval in which it is located, the so-called modal interval. In a variation series with equal intervals, the modal interval is determined by the highest frequency, in series with unequal intervals - but highest density distributions. The formula is then used to determine the mode in rows at equal intervals

where Mo is the fashion value; xMo - lower limit of the modal interval; h- modal interval width; / Mo - frequency of the modal interval; / Mo j is the frequency of the premodal interval; / Mo+1 is the frequency of the post-modal interval, and for a series with unequal intervals in this calculation formula, instead of the frequencies / Mo, / Mo, / Mo, distribution densities should be used Mind 0 _| , Mind 0> UMO+"

If there is a single mode, then the probability distribution of the random variable is called unimodal; if there is more than one mode, it is called multimodal (polymodal, multimodal), in the case of two modes - bimodal. As a rule, multimodality indicates that the distribution under study does not obey the normal distribution law. Homogeneous populations, as a rule, are characterized by single-vertex distributions. Multivertex also indicates the heterogeneity of the population being studied. The appearance of two or more vertices makes it necessary to regroup the data in order to identify more homogeneous groups.

In an interval variation series, the mode can be determined graphically using a histogram. To do this, draw two intersecting lines from the top points of the highest column of the histogram to the top points of two adjacent columns. Then, from the point of their intersection, a perpendicular is lowered onto the abscissa axis. The value of the feature on the x-axis corresponding to the perpendicular is the mode. In many cases, when characterizing a population as a generalized indicator, preference is given to the mode rather than the arithmetic mean.

Median - This is the central value of the attribute; it is possessed by the central member of the ranked series of the distribution. In discrete series, to find the value of the median, its serial number is first determined. To do this, if the number of units is odd, one is added to the sum of all frequencies, and the number is divided by two. If there are an even number of units in a row, there will be two median units, so in this case the median is defined as the average of the values ​​of the two median units. Thus, the median in a discrete variation series is the value that divides the series into two parts containing same number options.

In interval series, after determining the serial number of the median, the medial interval is found using the accumulated frequencies (frequencies), and then using the formula for calculating the median, the value of the median itself is determined:

where Me is the median value; x Me - lower limit of the median interval; h- width of the median interval; - the sum of the frequencies of the distribution series; /D - accumulated frequency of the pre-median interval; / Me - frequency of the median interval.

The median can be found graphically using a cumulate. To do this, on the scale of accumulated frequencies (frequencies) of the cumulate, from the point corresponding to the ordinal number of the median, a straight line is drawn parallel to the abscissa axis until it intersects with the cumulate. Next, from the point of intersection of the indicated line with the cumulate, a perpendicular is lowered to the abscissa axis. The value of the attribute on the x-axis corresponding to the drawn ordinate (perpendicular) is the median.

The median is characterized by the following properties.

  • 1. It does not depend on those attribute values ​​that are located on either side of it.
  • 2. It has the property of minimality, which means that the sum of absolute deviations of the attribute values ​​from the median represents a minimum value compared to the deviation of the attribute values ​​from any other value.
  • 3. When combining two distributions with known medians, it is impossible to predict in advance the value of the median of the new distribution.

These properties of the median are widely used when designing the location of public service points - schools, clinics, gas stations, water pumps, etc. For example, if it is planned to build a clinic in a certain block of the city, then it would be more expedient to locate it at a point in the block that halves not the length of the block, but the number of residents.

The ratio of the mode, median and arithmetic mean indicates the nature of the distribution of the characteristic in the aggregate and allows us to assess the symmetry of the distribution. If x Me then there is a right-sided asymmetry of the series. With normal distribution X - Me - Mo.

K. Pearson, based on the alignment of various types of curves, determined that for moderately asymmetric distributions the following approximate relationships between the arithmetic mean, median and mode are valid:

where Me is the median value; Mo - meaning of fashion; x arithm - the value of the arithmetic mean.

If there is a need to study the structure of the variation series in more detail, then calculate characteristic values ​​similar to the median. Such characteristic values ​​divide all distribution units into equal numbers; they are called quantiles or gradients. Quantiles are divided into quartiles, deciles, percentiles, etc.

Quartiles divide the population into four equal parts. The first quartile is calculated similarly to the median using the formula for calculating the first quartile, having previously determined the first quarterly interval:

where Qi is the value of the first quartile; xQ^- lower limit of the first quartile range; h- width of the first quarter interval; /, - frequencies of the interval series;

Cumulative frequency in the interval preceding the first quartile interval; Jq ( - frequency of the first quartile interval.

The first quartile shows that 25% of the population units are less than its value, and 75% are more. The second quartile is equal to the median, i.e. Q 2 = Me.

By analogy, the third quartile is calculated, having first found the third quarterly interval:

where is the lower limit of the third quartile range; h- width of the third quartile interval; /, - frequencies of the interval series; /X" - accumulated frequency in the interval preceding

G

third quartile interval; Jq is the frequency of the third quartile interval.

The third quartile shows that 75% of the population units are less than its value, and 25% are more.

The difference between the third and first quartiles is the interquartile range:

where Aq is the value of the interquartile range; Q 3 - third quartile value; Q, is the value of the first quartile.

Deciles divide the population into 10 equal parts. A decile is a value of a characteristic in a distribution series that corresponds to tenths of the population size. By analogy with quartiles, the first decile shows that 10% of the population units are less than its value, and 90% are greater, and the ninth decile reveals that 90% of the population units are less than its value, and 10% are greater. The ratio of the ninth and first deciles, i.e. The decile coefficient is widely used in the study of income differentiation to measure the ratio of the income levels of the 10% most affluent and 10% of the least affluent population. Percentiles divide the ranked population into 100 equal parts. The calculation, meaning, and application of percentiles are similar to deciles.

Quartiles, deciles and other structural characteristics can be determined graphically by analogy with the median using cumulates.

To measure the size of variation, the following indicators are used: range of variation, average linear deviation, standard deviation, dispersion. The magnitude of the variation range depends entirely on the randomness of the distribution of the extreme members of the series. This indicator is of interest in cases where it is important to know what the amplitude of fluctuations in the values ​​of a characteristic is:

Where R- the value of the range of variation; x max - maximum value of the attribute; x tt - minimum value of the attribute.

When calculating the range of variation, the value of the vast majority of series members is not taken into account, while the variation is associated with each value of the series member. Indicators that are averages obtained from deviations of individual values ​​of a characteristic from their average value do not have this drawback: the average linear deviation and the standard deviation. There is a direct relationship between individual deviations from the average and the variability of a particular trait. The stronger the fluctuation, the greater the absolute size of the deviations from the average.

The average linear deviation is the arithmetic mean of the absolute values ​​of deviations of individual options from their average value.

Average Linear Deviation for Ungrouped Data

where /pr is the value of the average linear deviation; x, - is the value of the attribute; X - P - number of units in the population.

Average linear deviation of the grouped series

where / vz - the value of the average linear deviation; x, is the value of the attribute; X - the average value of the characteristic for the population being studied; / - the number of population units in a separate group.

In this case, the signs of deviations are ignored, otherwise the sum of all deviations will be equal to zero. The average linear deviation, depending on the grouping of the analyzed data, is calculated using various formulas: for grouped and ungrouped data. Due to its convention, the average linear deviation, separately from other indicators of variation, is used in practice relatively rarely (in particular, to characterize the fulfillment of contractual obligations regarding uniformity of delivery; in the analysis of foreign trade turnover, the composition of employees, the rhythm of production, product quality, taking into account the technological features of production and etc.).

The standard deviation characterizes how much on average the individual values ​​of the characteristic being studied deviate from the average value of the population, and is expressed in units of measurement of the characteristic being studied. The standard deviation, being one of the main measures of variation, is widely used in assessing the limits of variation of a characteristic in a homogeneous population, in determining the ordinate values ​​of a normal distribution curve, as well as in calculations related to the organization of sample observation and establishing the accuracy of sample characteristics. The standard deviation of ungrouped data is calculated using the following algorithm: each deviation from the mean is squared, all squares are summed, after which the sum of squares is divided by the number of terms of the series and the square root is extracted from the quotient:

where a Iip is the value of the standard deviation; Xj- attribute value; X- the average value of the characteristic for the population being studied; P - number of units in the population.

For grouped analyzed data, the standard deviation of the data is calculated using the weighted formula

Where - standard deviation value; Xj- attribute value; X - the average value of the characteristic for the population being studied; f x - the number of population units in a particular group.

The expression under the root in both cases is called variance. Thus, dispersion is calculated as the average square of deviations of attribute values ​​from their average value. For unweighted (simple) attribute values, the variance is determined as follows:

For weighted characteristic values

There is also a special simplified method for calculating variance: in general

for unweighted (simple) characteristic values for weighted characteristic values
using the zero-based method

where a 2 is the dispersion value; x, - is the value of the attribute; X - average value of the characteristic, h- group interval value, t 1 - weight (A =

Dispersion has its own expression in statistics and is one of the most important indicators of variation. It is measured in units corresponding to the square of the units of measurement of the characteristic being studied.

The dispersion has the following properties.

  • 1. The variance of a constant value is zero.
  • 2. Reducing all values ​​of a characteristic by the same value A does not change the value of the dispersion. This means that the average square of deviations can be calculated not from given values ​​of a characteristic, but from their deviations from some constant number.
  • 3. Reducing any characteristic values ​​in k times reduces the dispersion by k 2 times, and the standard deviation is in k times, i.e. all values ​​of the attribute can be divided by some constant number (say, by the value of the series interval), the standard deviation can be calculated, and then multiplied by a constant number.
  • 4. If we calculate the average square of deviations from any value And differing to one degree or another from the arithmetic mean, then it will always be greater than the average square of the deviations calculated from the arithmetic mean. The average square of the deviations will be greater by a very certain amount - by the square of the difference between the average and this conventionally taken value.

Variation of an alternative characteristic consists in the presence or absence of the studied property in units of the population. Quantitatively, the variation of an alternative attribute is expressed by two values: the presence of a unit of the studied property is denoted by one (1), and its absence is denoted by zero (0). The proportion of units that have the property under study is denoted by P, and the proportion of units that do not have this property is denoted by G. Thus, the variance of an alternative attribute is equal to the product of the proportion of units possessing this property (P) by the proportion of units not possessing this property (G). The greatest variation in the population is achieved in cases where part of the population, constituting 50% of the total volume of the population, has a characteristic, and another part of the population, also equal to 50%, does not have this characteristic, and the dispersion reaches a maximum value of 0.25, t .e. P = 0.5, G= 1 - P = 1 - 0.5 = 0.5 and o 2 = 0.5 0.5 = 0.25. Bottom line of this indicator is equal to zero, which corresponds to a situation in which there is no variation in the aggregate. Practical use variance of an alternative characteristic consists of constructing confidence intervals when conducting a sample observation.

The smaller the variance and standard deviation, the more homogeneous the population and the more typical the average will be. In the practice of statistics, there is often a need to compare variations of various characteristics. For example, it is interesting to compare variations in the age of workers and their qualifications, length of service and wages, cost and profit, length of service and labor productivity, etc. For such comparisons, indicators of absolute variability of characteristics are unsuitable: it is impossible to compare the variability of work experience, expressed in years, with the variation of wages, expressed in rubles. To carry out such comparisons, as well as comparisons of the variability of the same characteristic in several populations with different arithmetic averages, variation indicators are used - the coefficient of oscillation, the linear coefficient of variation and the coefficient of variation, which show the measure of fluctuations of extreme values ​​around the average.

Oscillation coefficient:

Where V R - oscillation coefficient value; R- value of the range of variation; X -

Linear coefficient of variation".

Where Vj- the value of the linear coefficient of variation; I - the value of the average linear deviation; X - the average value of the characteristic for the population being studied.

The coefficient of variation:

Where V a - coefficient of variation value; a is the value of the standard deviation; X - the average value of the characteristic for the population being studied.

The coefficient of oscillation is the percentage ratio of the range of variation to the average value of the characteristic being studied, and the linear coefficient of variation is the ratio of the average linear deviation to the average value of the characteristic being studied, expressed as a percentage. The coefficient of variation is the percentage of the standard deviation to the average value of the characteristic being studied. As a relative value, expressed as a percentage, the coefficient of variation is used to compare the degree of variation of various characteristics. Using the coefficient of variation, the homogeneity of a statistical population is assessed. If the coefficient of variation is less than 33%, then the population under study is homogeneous and the variation is weak. If the coefficient of variation is more than 33%, then the population under study is heterogeneous, the variation is strong, and the average value is atypical and cannot be used as a general indicator of this population. In addition, coefficients of variation are used to compare the variability of one trait in different populations. For example, to assess the variation in the length of service of workers at two enterprises. The higher the coefficient value, the more significant the variation of the characteristic.

Based on the calculated quartiles, it is also possible to calculate the relative indicator of quarterly variation using the formula

where Q 2 And

The interquartile range is determined by the formula

The quartile deviation is used instead of the range of variation to avoid the disadvantages associated with using extreme values:

For unequally interval variation series, the distribution density is also calculated. It is defined as the quotient of the corresponding frequency or frequency divided by the value of the interval. In unequal interval series, absolute and relative distribution densities are used. The absolute distribution density is the frequency per unit length of the interval. Relative distribution density - frequency per unit interval length.

All of the above is true for distribution series, the distribution law of which is well described normal law distribution or close to it.

Let's call the different sample values options series of values ​​and denote: X 1 , X 2,…. First of all we will produce ranging options, i.e. their arrangement in ascending or descending order. For each option, its own weight is indicated, i.e. a number that characterizes the contribution of a given option to the total population. Frequencies or frequencies act as weights.

Frequency n i option x i is a number that shows how many times a given option occurs in the sample population under consideration.

Frequency or relative frequency w i option x i is a number equal to the ratio of the frequency of a variant to the sum of the frequencies of all variants. Frequency shows what proportion of units in the sample population have a given variant.

A sequence of options with their corresponding weights (frequencies or frequencies), written in ascending (or descending) order, is called variation series.

Variation series are discrete and interval.

For a discrete variation series, point values ​​of the characteristic are specified, for an interval series, the characteristic values ​​are specified in the form of intervals. Variation series can show the distribution of frequencies or relative frequencies (frequencies), depending on what value is indicated for each option - frequency or frequency.

Discrete variation series of frequency distribution has the form:

The frequencies are found by the formula, i = 1, 2, …, m.

w 1 +w 2 + … + w m = 1.

Example 4.1. For a given set of numbers

4, 6, 6, 3, 4, 9, 6, 4, 6, 6

build discrete variation series frequency and frequency distributions.

Solution . The volume of the population is equal to n= 10. The discrete frequency distribution series has the form

Interval series have a similar form of recording.

Interval variation series of frequency distribution is written as:

The sum of all frequencies is equal total number observations, i.e. total volume: n = n 1 +n 2 + … + n m.

Interval variation series of distribution of relative frequencies (frequencies) has the form:

The frequency is found by the formula, i = 1, 2, …, m.

The sum of all frequencies is equal to one: w 1 +w 2 + … + w m = 1.

Interval series are most often used in practice. If there is a lot of statistical sample data and their values ​​differ from each other by an arbitrarily small amount, then a discrete series for these data will be quite cumbersome and inconvenient for further research. In this case, data grouping is used, i.e. The interval containing all the values ​​of the attribute is divided into several partial intervals and, by calculating the frequency for each interval, an interval series is obtained. Let us write down in more detail the scheme for constructing an interval series, assuming that the lengths of the partial intervals will be the same.

2.2 Construction of an interval series

To build an interval series you need:

Determine the number of intervals;

Determine the length of the intervals;

Determine the location of the intervals on the axis.

For determining number of intervals k There is Sturges' formula, according to which

,

Where n- the volume of the entire aggregate.

For example, if there are 100 values ​​of a characteristic (variant), then it is recommended to take the number of intervals equal to the intervals to construct an interval series.

However, very often in practice the number of intervals is chosen by the researcher himself, taking into account that this number should not be very large so that the series is not cumbersome, but also not very small so as not to lose some properties of the distribution.

Interval length h determined by the following formula:

,

Where x max and x min is the largest and smallest values ​​of the options, respectively.

Size called scope row.

To construct the intervals themselves, they proceed in different ways. One of the most simple ways is as follows. The beginning of the first interval is taken to be
. Then the remaining boundaries of the intervals are found by the formula. Obviously, the end of the last interval a m+1 must satisfy the condition

After all the boundaries of the intervals have been found, the frequencies (or frequencies) of these intervals are determined. To solve this problem, look through all the options and determine the number of options that fall into a particular interval. Let's look at the complete construction of an interval series using an example.

Example 4.2. For the following statistical data, recorded in ascending order, construct an interval series with the number of intervals equal to 5:

11, 12, 12, 14, 14, 15, 21, 21, 22, 23, 25, 38, 38, 39, 42, 42, 44, 45, 50, 50, 55, 56, 58, 60, 62, 63, 65, 68, 68, 68, 70, 75, 78, 78, 78, 78, 80, 80, 86, 88, 90, 91, 91, 91, 91, 91, 93, 93, 95, 96.

Solution. Total n=50 variant values.

The number of intervals is specified in the problem statement, i.e. k=5.

The length of the intervals is
.

Let's define the boundaries of the intervals:

a 1 = 11 − 8,5 = 2,5; a 2 = 2,5 + 17 = 19,5; a 3 = 19,5 + 17 = 36,5;

a 4 = 36,5 + 17 = 53,5; a 5 = 53,5 + 17 = 70,5; a 6 = 70,5 + 17 = 87,5;

a 7 = 87,5 +17 = 104,5.

To determine the frequency of intervals, we count the number of options that fall into a given interval. For example, the first interval from 2.5 to 19.5 includes options 11, 12, 12, 14, 14, 15. Their number is 6, therefore, the frequency of the first interval is n 1 =6. The frequency of the first interval is . The second interval from 19.5 to 36.5 includes options 21, 21, 22, 23, 25, the number of which is 5. Therefore, the frequency of the second interval is n 2 =5, and frequency . Having found the frequencies and frequencies for all intervals in a similar way, we obtain the following interval series.

The interval series of frequency distribution has the form:

The sum of the frequencies is 6+5+9+11+8+11=50.

The interval series of frequency distribution has the form:

The sum of the frequencies is 0.12+0.1+0.18+0.22+0.16+0.22=1. ■

When constructing interval series, depending on the specific conditions of the problem under consideration, other rules can be applied, namely

1. Interval variation series can consist of partial intervals of different lengths. Unequal lengths of intervals make it possible to highlight the properties of a statistical population with an uneven distribution of the characteristic. For example, if the boundaries of the intervals determine the number of inhabitants in cities, then it is advisable in this problem to use intervals of unequal length. Obviously, for small cities a small difference in the number of inhabitants is important, but for large cities a difference of tens or hundreds of inhabitants is not significant. Interval series with unequal lengths of partial intervals are studied mainly in the general theory of statistics and their consideration is beyond the scope of this manual.

2. In mathematical statistics, interval series are sometimes considered, for which left border of the first interval is assumed to be equal to –∞, and the right boundary of the last interval is +∞. This is done in order to bring the statistical distribution closer to the theoretical one.

3. When constructing interval series, it may turn out that the value of some option coincides exactly with the boundary of the interval. The best thing to do in this case is as follows. If there is only one such coincidence, then consider that the option under consideration with its frequency fell into the interval located closer to the middle of the interval series; if there are several such options, then either all of them are assigned to the intervals to the right of these options, or all of them are assigned to the left.

4. After determining the number of intervals and their length, the arrangement of intervals can be done in another way. Find the arithmetic mean of all considered values ​​of the options X Wed and build the first interval in such a way that this sample average would be inside some interval. Thus, we get the interval from X Wed – 0.5 h before X avg.. + 0.5 h. Then to the left and to the right, adding the length of the interval, we build the remaining intervals until x min and x max will not fall into the first and last intervals, respectively.

5. Interval series with a large number of intervals are conveniently written vertically, i.e. write intervals not in the first row, but in the first column, and frequencies (or frequencies) in the second column.

Sample data can be considered as values ​​of some random variable X. A random variable has its own distribution law. From probability theory it is known that the distribution law of a discrete random variable can be specified in the form of a distribution series, and for a continuous one - using the distribution density function. However, there is a universal distribution law that holds for both discrete and continuous random variables. This distribution law is given as a distribution function F(x) = P(X<x). For sample data, you can specify an analogue of the distribution function - the empirical distribution function.

An example of solving a test on mathematical statistics

Problem 1

Initial data : students of a certain group consisting of 30 people passed an exam in the “Informatics” course. The grades received by students form the following series of numbers:

I. Let's create a variation series

m x

w x

m x nak

w x nak

Total:

II. Graphic representation of statistical information.

III. Numerical characteristics of the sample.

1. Arithmetic mean

2. Geometric mean

3. Fashion

4. Median

222222333333333 | 3 34444444445555

5. Sample variance

7. Coefficient of variation

8. Asymmetry

9. Asymmetry coefficient

10. Excess

11. Kurtosis coefficient

Problem 2

Initial data : Students of some group wrote their final test. The group consists of 30 people. The points scored by students form the following series of numbers

Solution

I. Since the characteristic takes on many different values, we will construct an interval variation series for it. To do this, first set the interval value h. Let's use Stanger's formula

Let's create an interval scale. In this case, we will take as the upper limit of the first interval the value determined by the formula:

We determine the upper boundaries of subsequent intervals using the following recurrent formula:

, Then

We finish constructing the interval scale, since the upper limit of the next interval has become greater than or equal to the maximum sample value
.

II. Graphic display of interval variation series

III. Numerical characteristics of the sample

To determine the numerical characteristics of the sample, we will compose an auxiliary table

Sum:

1. Arithmetic mean

2. Geometric mean

3. Fashion

4. Median

10 11 12 12 13 13 13 13 14 14 14 14 15 15 15 |15 15 15 16 16 16 16 16 17 17 18 19 19 20 20

5. Sample variance

6. Sample standard deviation

7. Coefficient of variation

8. Asymmetry

9. Asymmetry coefficient

10. Excess

11. Kurtosis coefficient

Problem 3

Condition : the ammeter scale division value is 0.1 A. Readings are rounded to the nearest whole division. Find the probability that during the reading an error will be made that exceeds 0.02 A.

Solution.

The rounding error of the sample can be considered as a random variable X, which is distributed evenly in the interval between two adjacent integer divisions. Uniform distribution density

,

Where
- length of the interval containing possible values X; outside this interval
In this problem, the length of the interval containing possible values ​​is X, is equal to 0.1, so

The reading error will exceed 0.02 if it is in the interval (0.02; 0.08). Then

Answer: R=0,6

Problem 4

Initial data: mathematical expectation and standard deviation of a normally distributed characteristic X respectively equal to 10 and 2. Find the probability that as a result of the test X will take the value contained in the interval (12, 14).

Solution.

Let's use the formula

And theoretical frequencies

Solution

For X its mathematical expectation is M(X) and variance D(X). Solution. Let's find the distribution function F(x) of the random variable... sampling error). Let's compose variational row Interval width will be: For each value row Let's calculate how many...

  • Solution: separable equation

    Solution

    In the form of To find the quotient solutions inhomogeneous equation let's make up system Let's solve the resulting system... ; +47; +61; +10; -8. Build interval variational row. Give statistical estimates of the average value...

  • Solution: Let's calculate chain and basic absolute increases, growth rates, growth rates. We summarize the obtained values ​​in Table 1

    Solution

    Volume of production. Solution: Arithmetic mean of interval variational row is calculated as follows: for... Marginal sampling error with probability 0.954 (t=2) will be: Δ w = t*μ = 2*0.0146 = 0.02927 Let’s define the boundaries...

  • Solution. Sign

    Solution

    About whose work experience and made up sample. The sample average work experience... of these employees and made up sample. The average duration for the sample... 1.16, significance level α = 0.05. Solution. Variational row of this sample looks like: 0.71 ...

  • Working curriculum in biology for grades 10-11 Compiled by: Polikarpova S. V.

    Working curriculum

    The simplest crossing schemes" 5 L.r. " Solution elementary genetic problems" 6 L.b. " Solution elementary genetic problems" 7 L.r. "..., 110, 115, 112, 110. Compose variational row, draw variational curve, find the average value of the characteristic...

  • Statistical distribution series– this is an ordered distribution of population units into groups according to a certain varying characteristic.
    Depending on the characteristic underlying the formation of the distribution series, there are attributive and variational distribution series.

    The presence of a common characteristic is the basis for the formation of a statistical population, which represents the results of describing or measuring the general characteristics of the objects of study.

    The subject of study in statistics is changing (varying) characteristics or statistical characteristics.

    Types of statistical characteristics.

    Distribution series are called attributive built according to quality criteria. Attributive– this is a sign that has a name (for example, profession: seamstress, teacher, etc.).
    The distribution series is usually presented in the form of tables. In table 2.8 shows the attribute distribution series.
    Table 2.8 - Distribution of types of legal assistance provided by lawyers to citizens of one of the regions of the Russian Federation.

    Variation series– these are the values ​​of the characteristic (or intervals of values) and their frequencies.
    Variation series are distribution series, built on a quantitative basis. Any variation series consists of two elements: options and frequencies.
    Variants are considered to be the individual values ​​of a characteristic that it takes in a variation series.
    Frequencies are the numbers of individual variants or each group of a variation series, i.e. These are numbers showing how often certain options occur in a distribution series. The sum of all frequencies determines the size of the entire population, its volume.
    Frequencies are frequencies expressed as fractions of a unit or as a percentage of the total. Accordingly, the sum of the frequencies is equal to 1 or 100%. The variation series allows one to estimate the form of the distribution law based on actual data.

    Depending on the nature of the variation of the trait, there are discrete and interval variation series.
    An example of a discrete variation series is given in table. 2.9.
    Table 2.9 - Distribution of families by the number of occupied rooms in individual apartments in 1989 in the Russian Federation.

    The first column of the table presents options for a discrete variation series, the second column contains the frequencies of the variation series, and the third contains frequency indicators.

    Variation series

    A certain quantitative characteristic is studied in the general population. A sample of volume is randomly extracted from it n, that is, the number of sample elements is equal to n. At the first stage of statistical processing, ranging samples, i.e. number ordering x 1 , x 2 , …, x n Ascending. Each observed value x i called option. Frequency m i is the number of observations of the value x i in the sample. Relative frequency (frequency) w i is the frequency ratio m i to sample size n: .
    When studying variation series, the concepts of accumulated frequency and accumulated frequency are also used. Let x some number. Then the number of options , whose values ​​are less x, is called the accumulated frequency: for x i n is called the accumulated frequency w i max.
    A characteristic is called discretely variable if its individual values ​​(variants) differ from each other by a certain finite value (usually an integer). The variation series of such a characteristic is called a discrete variation series.

    Table 1. General view of a discrete variation frequency series

    Characteristic valuesx i x 1 x 2 x n
    Frequenciesm i m 1 m 2 m n

    A characteristic is called continuously varying if its values ​​differ from each other by an arbitrarily small amount, i.e. a sign can take any value in a certain interval. A continuous variation series for such a characteristic is called interval.

    Table 2. General view of the interval variation series of frequencies

    Table 3. Graphic images of the variation series

    RowPolygon or histogramEmpirical distribution function
    Discrete
    Interval
    By reviewing the results of the observations, it is determined how many variant values ​​fall into each specific interval. It is assumed that each interval belongs to one of its ends: either in all cases left (more often) or in all cases right, and frequencies or frequencies show the number of options contained within the specified boundaries. Differences a i – a i +1 are called partial intervals. To simplify subsequent calculations, the interval variation series can be replaced by a conditionally discrete one. In this case, the average value i-interval is taken as an option x i, and the corresponding interval frequency m i– for the frequency of this interval.
    For graphical representation of variation series, the most commonly used are polygon, histogram, cumulative curve and empirical distribution function.

    In table 2.3 (Grouping of the Russian population by average per capita income in April 1994) is presented interval variation series.
    It is convenient to analyze distribution series using a graphical image, which allows one to judge the shape of the distribution. A visual representation of the nature of changes in the frequencies of the variation series is given by polygon and histogram.
    The polygon is used when depicting discrete variation series.
    Let us, for example, graphically depict the distribution of housing stock by type of apartment (Table 2.10).
    Table 2.10 - Distribution of the housing stock of the urban area by type of apartment (conditional figures).


    Rice. Housing distribution area


    Not only the frequency values, but also the frequencies of the variation series can be plotted on the ordinate axes.
    The histogram is used to depict an interval variation series. When constructing a histogram, the values ​​of the intervals are plotted on the abscissa axis, and the frequencies are depicted by rectangles built on the corresponding intervals. The height of the columns in the case of equal intervals should be proportional to the frequencies. A histogram is a graph in which a series is depicted as bars adjacent to each other.
    Let us graphically depict the interval distribution series given in table. 2.11.
    Table 2.11 - Distribution of families by size of living space per person (conditional figures).
    N p/p Groups of families by size of living space per person Number of families with a given size of living space Cumulative number of families
    1 3 – 5 10 10
    2 5 – 7 20 30
    3 7 – 9 40 70
    4 9 – 11 30 100
    5 11 – 13 15 115
    TOTAL 115 ----


    Rice. 2.2. Histogram of the distribution of families by the size of living space per person


    Using the data of the accumulated series (Table 2.11), we construct cumulate distribution.


    Rice. 2.3. Cumulative distribution of families by size of living space per person


    The representation of a variation series in the form of a cumulate is especially effective for variation series whose frequencies are expressed as fractions or percentages of the sum of the series frequencies.
    If we change the axes when graphically depicting a variation series in the form of cumulates, then we get ogiva. In Fig. 2.4 shows an ogive constructed on the basis of the data in Table. 2.11.
    A histogram can be converted into a distribution polygon by finding the midpoints of the sides of the rectangles and then connecting these points with straight lines. The resulting distribution polygon is shown in Fig. 2.2 with a dotted line.
    When constructing a histogram of the distribution of a variation series with unequal intervals, it is not the frequencies that are plotted along the ordinate axis, but the density of the distribution of the characteristic in the corresponding intervals.
    The distribution density is the frequency calculated per unit interval width, i.e. how many units in each group are per unit of interval value. An example of calculating the distribution density is presented in table. 2.12.
    Table 2.12 - Distribution of enterprises by number of employees (conditional figures)
    N p/p Groups of enterprises by number of employees, people. Number of enterprises Interval size, people. Distribution density
    A 1 2 3=1/2
    1 Up to 20 15 20 0,75
    2 20 – 80 27 60 0,25
    3 80 – 150 35 70 0,5
    4 150 – 300 60 150 0,4
    5 300 – 500 10 200 0,05
    TOTAL 147 ---- ----

    Can also be used to graphically represent variation series cumulative curve. Using a cumulate (sum curve), a series of accumulated frequencies is depicted. Cumulative frequencies are determined by sequentially summing frequencies across groups and show how many units in the population have attribute values ​​no greater than the value under consideration.


    Rice. 2.4. Ogive of distribution of families by the size of living space per person

    When constructing the cumulates of an interval variation series, variants of the series are plotted along the abscissa axis, and accumulated frequencies are plotted along the ordinate axis.

    Continuous variation series

    Continuous variation series - a series constructed on the basis of a quantitative statistical characteristic. Example. The average duration of illness of convicts (days per person) in the autumn-winter period this year was:
    7,0 6,0 5,9 9,4 6,5 7,3 7,6 9,3 5,8 7,2
    7,1 8,3 7,5 6,8 7,1 9,2 6,1 8,5 7,4 7,8
    10,2 9,4 8,8 8,3 7,9 9,2 8,9 9,0 8,7 8,5

    A set of objects or phenomena united by some common feature or property of a qualitative or quantitative nature is called object of observation .

    Every object of statistical observation consists of individual elements - observation units .

    The results of statistical observation represent numerical information - data . Statistical data - this is information about what values ​​the characteristic of interest to the researcher took in the statistical population.

    If the values ​​of a characteristic are expressed in numbers, then the characteristic is called quantitative .

    If a sign characterizes some property or state of the elements of a population, then the sign is called high quality .

    If all elements of a population are subject to study (continuous observation), then the statistical population is called general

    If part of the elements of the general population is subject to research, then the statistical population is called selective (sampling) . A sample from a population is drawn at random so that each of the n elements in the sample has an equal chance of being selected.

    The values ​​of a characteristic change (vary) when moving from one element of the population to another, therefore in statistics different values ​​of a characteristic are also called options . Options are usually denoted in small Latin letters x, y, z.

    The serial number of the option (characteristic value) is called rank . x 1 - 1st option (1st value of the attribute), x 2 - 2nd option (2nd value of the attribute), x i - i-th option (i-th attribute value).

    A series of attribute values ​​(options) ordered in ascending or descending order with their corresponding weights is called variation series (distribution series).

    As scales frequencies or frequencies appear.

    Frequency(m i) shows how many times this or that option (attribute value) occurs in the statistical population.

    Frequency or relative frequency(w i) shows what part of the population units has one or another option. Frequency is calculated as the ratio of the frequency of a particular option to the sum of all frequencies of the series.

    . (6.1)

    The sum of all frequencies is 1.

    . (6.2)

    Variation series are discrete and interval.

    Discrete variation series They are usually constructed if the values ​​of the characteristic being studied may differ from each other by no less than a certain finite amount.

    In discrete variation series, point values ​​of the characteristic are specified.

    The general view of the discrete variation series is shown in Table 6.1.

    Table 6.1

    where i = 1, 2, … , l.

    In interval variation series, in each interval the upper and lower boundaries of the interval are distinguished.

    The difference between the upper and lower boundaries of the interval is called interval difference or length (value) of the interval .

    The value of the first interval k 1 is determined by the formula:

    k 1 = a 2 - a 1;

    second: k 2 = a 3 - a 2; ...

    last: k l = a l - a l -1 .

    In general interval difference k i is calculated by the formula:

    k i = x i (max) - x i (min) . (6.3)

    If an interval has both boundaries, then it is called closed .

    The first and last intervals can be open , i.e. have only one border.

    For example, the first interval can be set as “up to 100”, the second - “100-110”, ..., the second to last - “190-200”, the last - “200 and more”. Obviously, the first interval has no lower boundary, and the last one has no upper boundary; both of them are open.

    Often open intervals have to be conditionally closed. To do this, usually the value of the first interval is taken equal to the value of the second, and the value of the last - to the value of the penultimate one. In our example, the value of the second interval is 110-100=10, therefore, the lower limit of the first interval will be conditionally 100-10=90; the value of the penultimate interval is 200-190=10, therefore, the upper limit of the last interval will be conditionally 200+10=210.

    In addition, in an interval variation series there may be intervals of different lengths. If the intervals in a variation series have the same length (interval difference), they are called equal in size , otherwise - unequal in size.

    When constructing an interval variation series, the problem of choosing the size of the intervals (interval difference) often arises.

    To determine the optimal size of intervals (in the event that a series is constructed with equal intervals), use Sturgess formula:

    , (6.4)

    where n is the number of units in the population,

    x (max) and x (min) - the largest and smallest values ​​of the series options.

    To characterize the variation series, along with frequencies and frequencies, accumulated frequencies and frequencies are used.

    Accumulated frequencies (frequencies) show how many units of the population (which part of them) do not exceed a given value (variant) x.

    Accumulated frequencies ( v i) based on discrete series data can be calculated using the following formula:

    . (6.5)

    For an interval variation series, this is the sum of the frequencies (frequencies) of all intervals not exceeding this one.

    A discrete variation series can be represented graphically using frequency distribution polygon or frequencies.

    When constructing a distribution polygon, the values ​​of the characteristic (variants) are plotted along the abscissa axis, and frequencies or frequencies are plotted along the ordinate axis. At the intersection of the attribute values ​​and the corresponding frequencies (frequencies), points are laid, which, in turn, are connected by segments. The resulting broken line is called a frequency (frequency) distribution polygon.

    x k
    x 2
    x 1 x i


    Rice. 6.1.

    Interval variation series can be represented graphically using histograms, i.e. bar chart.

    When constructing a histogram, the values ​​of the characteristic being studied (interval boundaries) are plotted along the abscissa axis.

    In the event that the intervals are of the same size, frequencies or frequencies can be plotted along the ordinate axis.

    If the intervals have different sizes, the values ​​of the absolute or relative distribution density must be plotted along the ordinate axis.

    Absolute density- ratio of interval frequency to interval size:

    ; (6.6)

    where: f(a) i - absolute density of the i-th interval;

    m i - frequency of the i-th interval;

    k i - the value of the i-th interval (interval difference).

    Absolute density shows how many population units there are per unit interval.

    Relative density- ratio of interval frequency to interval size:

    ; (6.7)

    where: f(о) i - relative density of the i-th interval;

    w i - frequency of the i-th interval.

    Relative density shows what part of the population units falls on a unit of the interval.

    a l
    a 1 x i
    a 2

    Both discrete and interval variation series can be represented graphically in the form of cumulates and ogives.

    When building cumulates according to the data of a discrete series, the values ​​of the characteristic (variants) are plotted along the x-axis, and the accumulated frequencies or frequencies are plotted along the ordinate axis. At the intersection of the values ​​of the attribute (variants) and the corresponding accumulated frequencies (frequencies), points are constructed, which, in turn, are connected by segments or a curve. The resulting broken line (curve) is called cumulate (cumulative curve).

    When constructing cumulates based on data from an interval series, the boundaries of the intervals are plotted along the abscissa axis. The abscissas of the points are the upper boundaries of the intervals. The ordinates form the accumulated frequencies (frequencies) of the corresponding intervals. Often another point is added, the abscissa of which is the lower boundary of the first interval, and the ordinate is zero. By connecting the points with segments or a curve, we obtain a cumulate.

    Ogiva is constructed similarly to a cumulate with the only difference being that the points corresponding to the accumulated frequencies (frequencies) are plotted on the abscissa axis, and the values ​​of the characteristic (variants) are plotted on the ordinate axis.