Big Data: what is it in simple words, big data technology (big data, data), characteristics, processing, working methods, definition, analysis systems

If we try to define in simple words what big data is (big data or, in translation, a large volume of data), then this is a general name for the information flow, technology, methods of its processing and analysis system. It is processed through the use of software tools that have become analogous to traditional Business Intelligence databases and solutions. All actions are aimed at structuring and obtaining new conclusions.

What it is

The IT sphere confidently fills the space around people. However, the acquired knowledge cannot go “nowhere,” and given its colossal size, the storage must be voluminous. Humanity has long switched to digital media, and they all differ in size.

To work with large amounts of information, you need a special set of tools and techniques in order to solve specific problems with their help. In fact, the collection of various data and the tools for working with them defines the term Big Data.

This socio-economic phenomenon is directly related to the emergence of scalable technologies that make it possible to work with huge amounts of information.

Which companies are involved in big data?

Cellular operators and search engines were the first to work with big data, or “big data”. Search engines have more and more queries, and text is heavier than numbers. It takes more time to work on a paragraph of text than on a financial transaction. The user expects the search engine to process the request in a split second - it is unacceptable for it to work even for half a minute. Therefore, search engines were the first to start working with parallelization when working with data.

A little later, various financial organizations and retail became involved. Their transactions themselves are not so voluminous, but big data appears due to the fact that there are a lot of transactions.

The amount of data is growing for everyone. For example, banks used to have a lot of data, but they did not always require the principles of working like with big data. Then banks began to work more with customer data. They began to come up with more flexible deposits, loans, different tariffs, and began to analyze transactions more closely. This already required fast ways of working.

Now banks want to analyze not only internal information, but also third-party information. They want to receive big data from the same retail, they want to know what a person spends money on. Based on this information, they try to make commercial offers.

Now all information is connected with each other. Retailers, banks, telecom operators and even search engines - everyone is now interested in each other's data.

The difference in the methods used

In total, there are 2 main approaches to analytics, which have radically different strategies.

Traditional	Modern
Analyzing small info blocks	Processing the entire array of information at once
Editing, structuring	Using sources
Hypothesis development and testing	Searching for relationships throughout the flow until a result is achieved
Step by step: collection, storage, analysis	Real-time analytics

VVV - signs of big data

To reduce the blurring of definitions in the field of Big Data, features have been developed that they must meet. Everyone starts with the letter V, so the system is called VVV:

• Volume – volume. Let's measure the amount of information.

• Velocity – speed. The volume of information is not static - it is constantly growing, and processing tools must take this into account.

• Variety - diversity. Information does not have to be in one format. It can be unstructured, partially or fully structured.

To these three principles, with the development of the industry, additional Vs are added. For example, veracity - reliability, value - value or viability - viability.

But for understanding, the first three are enough: big data is measurable, incremental and heterogeneous.

Ready-made solutions for all areas

Stores
Mobility, accuracy and speed of counting goods on the sales floor and in the warehouse will allow you not to lose days of sales during inventory and when receiving goods.

To learn more

Warehouses

Speed up your warehouse employees' work with mobile automation. Eliminate errors in receiving, shipping, inventory and movement of goods forever.

To learn more

Marking

Mandatory labeling of goods is an opportunity for each organization to 100% exclude the acceptance of counterfeit goods into its warehouse and track the supply chain from the manufacturer.

To learn more

E-commerce

Speed, accuracy of acceptance and shipment of goods in the warehouse is the cornerstone in the E-commerce business. Start using modern, more efficient mobile tools.

To learn more

Institutions

Increase the accuracy of accounting for the organization’s property, the level of control over the safety and movement of each item. Mobile accounting will reduce the likelihood of theft and natural losses.

To learn more

Production

Increase the efficiency of your manufacturing enterprise by introducing mobile automation for inventory accounting.

To learn more

RFID

The first ready-made solution in Russia for tracking goods using RFID tags at each stage of the supply chain.

To learn more

EGAIS

Eliminate errors in comparing and reading excise duty stamps for alcoholic beverages using mobile accounting tools.

To learn more

Certification for partners

Obtaining certified Cleverence partner status will allow your company to reach a new level of problem solving at your clients’ enterprises.

To learn more

Inventory

Use modern mobile tools to carry out product inventory. Increase the speed and accuracy of your business process.

To learn more

Mobile automation

Use modern mobile tools to account for goods and fixed assets in your enterprise. Completely abandon accounting “on paper”.

Learn more Show all automation solutions

Neural networks and pattern recognition

The task of visual image recognition is handled by artificial neural networks (ANN), which are mathematical models in the form of hardware and software implementations that imitate the functioning of neural networks of living organisms. The operation of neural networks is built according to one algorithm: the input receives data that passes through the neurons, and the output produces a certain result.

The method is used to solve problems in social and professional spheres, to ensure security, forecasting, classification, etc., etc. Technology makes it possible to replace the work of dozens of people.

An option for using neural networks with image recognition is to distinguish between photos of men and women.

To do this you will need:

Build a neural network, i.e. artificial neurons need to be programmed to perceive input data and build connections.
Send the neural network a sample of the purified information flow - a database of photographs with marks of female and male faces. This is necessary to train the neural network so that it can further understand by what criteria faces differ.
Run a neural network test, to do this, send a new cleaned sample with faces, but without marks. During testing, you can determine the frequency of errors.

History of origin

The first mention of the phenomenon came in 2008 from Clafford Lynch in an article in the journal Nature. According to him, this includes any heterogeneous knowledge received in the amount of more than 150 GB in one day.

According to calculations by analytical agencies, in 2005 more than 4-5 exabytes (4-5 billion gigabytes) were operated around the world. In 2010, the value increased to 0.20 zetta-bytes (1 ZB equals 1024 EB). At this time, the “big data” approach was considered only from a scientific and analytical point of view, but was not applied in practice. At the same time, the unstructured array grew inexorably. Over 2 years, that is, in 2012, the indicators increased to 1.8 Zb, and the storage problem became relevant and there was a surge of interest. By the beginning of 2015 - up to 7 Zb. “Digital giants” - Microsoft, IBM, Oracle, EMC, as well as universities, were actively involved in the development of this area, introducing applied sciences (engineering, physics, sociology) into practice.

Prospects for using Big Data

Blockchain and Big Data are two developing and complementary technologies. Since 2016, blockchain has been frequently discussed in the media. This is a cryptographically secure distributed database technology for storing and transmitting information. Protecting private and confidential information is a current and future big data problem that blockchain can solve.

Almost every industry has begun investing in Big Data analytics, but some are investing more than others. According to IDC, more is spent on banking, discrete manufacturing, process manufacturing, and professional services. According to Wikibon research, revenue from sales of programs and services on the global market in 2018 amounted to $42 billion, and in 2027 it will exceed the $100 billion mark.

Neimeth estimates that blockchain will account for up to 20% of the overall big data market by 2030, generating up to $100 billion in annual revenue. This exceeds the profits of PayPal, Visa and Mastercard combined.

Big Data analytics will be important for tracking transactions and will allow companies using blockchain to identify hidden patterns and find out who they are interacting with on the blockchain.

Main goals

Function	Task
BigData is a stream of raw knowledge	Preservation and operation
DataMaining - data structuring as a method for identifying patterns	Creating a unified structure based on discovered connections to achieve a common meaning
Machine learning is machine learning based on information that appears in the process. Later, the concept of Deep learning, powered by artificial intelligence, appeared.	Analysis and forecasting

Technology used

Processing the information field is necessary to provide users with a specific result for effective use in the future. That is, as a result, a person must receive the most useful information about various objects or phenomena, and also weigh the positive and negative aspects to choose a further decision. Artificial intelligence builds an approximate model of the future, offering several options, and then tracks the achieved result.

Existing analytics agencies run simulation programs to test different ideas. It suggests and provides a ready-made solution to the problem. That is, all steps are fully automated. Thus, Big Data can be safely called a modern alternative, which has replaced traditional analytical methods.

Sources are:

Internet (social networks, online stores, articles, forums);
corporate resources - business archives and active databases;
indicators from devices - sensors, electronic devices, weather data.

At the same time, despite the differences, unification and integration occur, further aimed at extracting and obtaining new knowledge.

You should remember the main rule - VVV, which serves as a characteristic of big data:

Volume is a measurement of volume in a physical quantity that occupies a certain space on a medium. The prefix “Big” means receiving an information array of more than 150 GB per day.
Velocity - regular updates in real time through the use of intelligent technologies.
Variety - absolute or partial unsystematicity, variety.

Over time, the above-mentioned signs were supplemented by two more factors:

Variability - the ability to change depending on external circumstances, uncontrollable surges and declines in incoming flows are often associated with periodicity;
Value - variability depending on complexity can make it difficult for artificial intelligence to function. That is, first a determination of the degree of significance is required, and after that comes the structuring stage.

To ensure uninterrupted operation of the system, it is necessary to simultaneously include three fundamental factors:

the possibility of horizontal expansion of space, that is, increasing the number of servers without performance degradation;
resistance to failure, namely, the number of digital media and intelligent machines must be increased to prevent the possibility of failure if one node fails;
locality - a dedicated place for storing and processing information, helping to save time and resources.

How Big Data is collected

Sources may be:

Internet - from social networks and media to the Internet of Things (IoT);
corporate data: logs, transactions, archives;
other devices that collect information, such as smart speakers.

Collection . The technology and the process of data collection itself is called data mining.

Services used to collect data are, for example, Vertica, Tableau, Power BI, Qlik. The collected data can be in different formats: text, Excel tables, SAS.

During the collection process, the system finds Petabytes of information, which will then be processed by intelligent analysis methods , which identify patterns. These include neural networks, clustering algorithms, algorithms for detecting associative connections between events, decision trees, and some machine learning methods.

Briefly, the process of collecting and processing information looks like this:

the analytical program receives the task;
the system collects the necessary information, simultaneously preparing it: removes irrelevant information, clears it of garbage, decodes it;
a model or algorithm for analysis is selected;
the program learns the algorithm and analyzes the patterns found.

Where to find application

The more information a person knows about certain objects and phenomena, the higher the likelihood of making an accurate forecast for the future. It’s not even worth saying once again that BigData received the greatest demand in business and marketing. However, this is not the only possible application of it in practice. BigData is being actively implemented in the following areas:

Medicine and health care. Increasing the amount of information available about diseases, treatments and drugs used makes it possible to overcome diseases that in the past often caused death.
Prevention of severe consequences of man-made and natural disasters. The collection comes from a variety of available sensors with precise location determination. Such forecasting can save thousands of people.
Law enforcement agencies use data to determine the possible increase in the criminal situation in the world and then take preventive measures depending on the situation.

For business automation, we also offer equipment that can greatly facilitate most routine tasks and simplify the work process.

What is it used for?

The more we know about a specific object or phenomenon, the more accurately we comprehend the essence and can predict the future. By capturing and processing data streams from sensors, the Internet, and transactional operations, companies can quite accurately predict demand for products, and emergency services can prevent man-made disasters. Here are a few examples outside of business and marketing of how big data technologies are used:

Healthcare. More knowledge about diseases, more treatment options, more information about medications - all this makes it possible to fight diseases that were considered incurable 40-50 years ago.
Prevention of natural and man-made disasters. The most accurate forecast in this area saves thousands of lives. The task of intelligent machines is to collect and process many sensor readings and, based on them, help people determine the date and location of a possible cataclysm.
Law enforcement agencies. Big data is used to predict the surge in crime in different countries and take deterrent measures where the situation requires it.

Methods of analysis and processing

The basics of the big data database system are to work with a huge information field, which is constantly updated with information using the following methods:

deep analysis with division into separate small groups. For this purpose, specialized mathematical digital algorithms are used;
crowdsourcing is based on the ability to receive and direct information flows from various sources, the number of which is limited by power, but not by quantity, for processing;
split tests are based on comparing elements from a starting point to the point of change. This is necessary to identify the factors that have the greatest impact. That is, as a result of testing, the most accurate result will be obtained;
forecasting is based on the introduction of new parameters with further verification of behavior after the arrival of a large array;
machine learning with the prospect of absorbing and processing knowledge by artificial intelligence, using it for independent learning;
analyzing online activity to divide the audience by interest, location, gender, age and other parameters.

MapReduce

We’ve already written about MapReduce on the hub (once, twice, three times), but since the series of articles claims to be a systematic presentation of Big Data issues, we can’t do without MapReduce in the first article J

MapReduce

is a distributed data processing model proposed by Google for processing large amounts of data on computer clusters. MapReduce is well illustrated by the following picture (taken from the link):

MapReduce assumes that data is organized into some records. Data processing occurs in 3 stages:

1. Stage Map

. At this stage, the data is preprocessed using the map() function, which is defined by the user. The work of this stage is to preprocess and filter the data. The operation is very similar to the map operation in functional programming languages - a user-defined function is applied to each input record.

The map() function applied to a single input record produces many key-value pairs

. Many – i.e. may return only one record, may return nothing, or may return multiple key-value pairs. What will be in the key and value is up to the user, but the key is a very important thing, since data with one key in the future will end up in one instance of the reduce function.

2. Shuffle stage

. It goes unnoticed by the user. At this stage, the output of the map function is “parsed into buckets” - each bucket corresponds to one output key of the map stage. In the future, these baskets will serve as the input for reduce.

3. Reduce stage

. Each “basket” of values generated at the shuffle stage goes to the input of the reduce() function.

The reduce function is specified by the user and calculates the final result for a separate “basket”.

The set of all values returned by the reduce() function is the final result of the MapReduce task.

A few additional facts about MapReduce:

1) All runs of the map

work independently and can work in parallel, including on different machines in the cluster.

2) All runs of the reduce

work independently and can work in parallel, including on different machines in the cluster.

3) Shuffle internally represents parallel sorting, so it can also work on different machines in the cluster. Points 1-3 allow you to implement the principle of horizontal scalability.

4) The map function is usually used on the same machine on which the data is stored - this allows to reduce data transfer over the network (the principle of data locality).

5) MapReduce is always a full scan of the data, there are no indexes. This means that MapReduce is not very useful when the answer is needed very quickly.

Solutions under development

Big data is the ability to effectively use the obtained information in a convenient and visual form to perform applied tasks. The main source is a person, and a variety of means can be used (social networks, media, etc.). Data is used primarily to conduct analysis and then create products. This could be consultations, goods or services, it is possible to introduce programs to optimize resource consumption, forecasting. At the same time, it is important to protect servers from fraudulent manipulations and the threat of a virus. Considering the nature of the information received, the programmer will be able to create unique platforms and barriers that protect against leakage.

How did the world develop?

The volume of information received is growing exponentially every year. If in 2003 it was only 5 EB, then in 2015 this figure increased to 6.5 EB and is still increasing. At the same time, new acquired knowledge can safely be called a vital asset, and the basics of safety should become the foundation. The widespread increase in the significance of the phenomenon can radically change the economic situation in the world, and an uninterested user will be in constant contact with various electrical devices.

Ready-made solutions for all areas

Stores
Mobility, accuracy and speed of counting goods on the sales floor and in the warehouse will allow you not to lose days of sales during inventory and when receiving goods.

To learn more

Warehouses

Speed up your warehouse employees' work with mobile automation. Eliminate errors in receiving, shipping, inventory and movement of goods forever.

To learn more

Marking

Mandatory labeling of goods is an opportunity for each organization to 100% exclude the acceptance of counterfeit goods into its warehouse and track the supply chain from the manufacturer.

To learn more

E-commerce

Speed, accuracy of acceptance and shipment of goods in the warehouse is the cornerstone in the E-commerce business. Start using modern, more efficient mobile tools.

To learn more

Institutions

To learn more

Production

Increase the efficiency of your manufacturing enterprise by introducing mobile automation for inventory accounting.

To learn more

RFID

The first ready-made solution in Russia for tracking goods using RFID tags at each stage of the supply chain.

To learn more

EGAIS

Eliminate errors in comparing and reading excise duty stamps for alcoholic beverages using mobile accounting tools.

To learn more

Certification for partners

Obtaining certified Cleverence partner status will allow your company to reach a new level of problem solving at your clients’ enterprises.

To learn more

Inventory

Use modern mobile tools to carry out product inventory. Increase the speed and accuracy of your business process.

To learn more

Mobile automation

Use modern mobile tools to account for goods and fixed assets in your enterprise. Completely abandon accounting “on paper”.

Learn more Show all automation solutions

Predictive Analytics

Predictive, predictive, or predictive analytics makes a forecast based on accumulated information, answering the question “What might happen?” Data is obtained using modeling methods, mathematical statistics, machine learning, data mining, etc.