What is Big Data Visualization?

By 2025, it is predicted that the value of data will increase by 10-fold. Virtually, every branch of industry or business will generate vast amount of data. Thus, the world will experience an aggressive growth and data could be a missed opportunity when not being utilized. And to make matter worse, the rate of collecting and storing data is faster than the ability to use them as a tangible decision-making. With the help of ever-growing technology, visionaries are creating visualization methods to help turning raw data with no value to an informative data.

Big data has served a purpose for organizations to optimize their businesses. With an abundant amount of data that organization generate every day, the ability to turn the data into a decision, effectively and efficiently is crucial. Thus, the knowledge of analytics and visualization would come hand-in-hand to tackle the problem in big data. Hence, a new interdisciplinary research field of “Visual Analytics” is being established, in which it aim to make the best possible use of the information by combining intelligent data analysis with the visual perception. The visual analytics knowledge has been quite useful to the two most common streams of profession in Big Data world, Data Scientist and Business Analytics.

Business Analytics – Business Analytics (BA) is defined as a data-centric approach that relies heavily toward collection, extraction and analysis tools to enable data to be use as an insight as well as decision-maker, which in most disciplines, is being used by top-management people. Previously, BA was used to report what has happened in the past, although nowadays, with the massive volume of data that can be generated, BA can exploit them to predict the future and make a breakthroughs.
Data Science – Through Big Data, the need to create a reliable source of information and a business support system has invented a new and widespread business application of Data Science. However, the art of data science is multifaceted, it combined the skills of computer science, advanced analytical and statistical skills, and knowledge of methods of visualizing data. Although there has been no universally accepted definition of Data Science, it is defined as a set of fundamental principles that support and guide the principled extraction of information and knowledge of data. One of the main thing that visualization can help is projecting a model that data scientist has built to the reader. They usually play with data that has hundreds of dimensions that does not have usual mapping point thus standard visualisation such as bar chart, will not work. Therefore, novel visualization employing Parallel Coordinates and others techniques, usually used in this type of data. Secondly, visualization can help the process of Data Mining, which is the process that scientist aim to automatically extract valuable information from raw data through automatic analysis algorithm. Visualization has been found to give benefit for the process and would help the analysis to arrive at the optimal point as it helps to appropriately communicating the results of the automatic analysis which often handed in abstract demonstration.

Figure 1 below has been constructed to elaborate the role of visualization through data exploration.

In the Visual Analytics Process above, the data that have been collected is being transformed according to the streams. For the Business Analytics (BA), the transformed data is being mapped into a visualization for a user to be processed into a knowledge, usually in a form of decisions, then the knowledge is being feed back into the data for a continuous improvement and to enable analyst to a better conclusion in the future. For the Data Science (DS) stream, the transformed data is being mined in order to build a models that would help certain objectives, the overall approach of the data is problem-agnostic. When certain models have been built, it would need to be visualized as well, or vice versa. There is a feedback loop in between models and visualization in order to get the right outcome for the objectives. Furthermore, the knowledge come from either visualization or models itself.

In general, visualization works as a better and faster way to identify pattern or trends and any correlation that would otherwise remain undetected with a text or numbers figure. And visualization also help to approach problem in a new and creative way that would tap into human’s cognitive brain to understand the information hiding behind a huge number of data. Human can also interact with visualization, which can be utilized to find more insights or to find the right questions.

Visualization’s Methods and Techniques

A poorly chosen technique of visualization can completely ruin a clear data, thus would affect how the information is perceived by the user. The visualization-based method has to be able to turn the challenges of the 3Vs and turn it into a “Value”. In discussing the four elements, Volume refer to an immense amount of datasets that’s generated from different type of devices, and a good visualization method should be able to cater to the volume of data. Variety refer the combination of data sources, and the visualization method needs to be able to combine them altogether to create a tangible value. Whereas Velocity refers to the ability of devices to give data in real time and continuously updating data streams, therefore visualization method is preferred when able to achieve this task. Lastly, Value in which refer to any opportunities that are able to be realized when the perfect visualization method is used. Therefore, the chosen methods and techniques of visualization shown were based on how they able to channel the 4Vs of Big Data accordingly.

1. Bar and Line Graph

Bar and Line graph are one of the oldest method of visualization, although it is not quite suitable for big data, it is still commonly used by business analyst as a means to present it to the stakeholders. Starting with a bar graph, which is a powerful tool to measure associated discrete items in category. It enables a user to emphasis on the individual values and compare it to another by simply comparing the height of the bars. As bar graph emphasis on individual value, they also make it easier to compare to adjacent value. Bars is preferable to lines for encoding the data within the interval scale, where the time series divided over a period of years and it is intended to support the comparison between the individual values. Moving on to Line Graph which is typically use to emphasize the overall shape of entire series of value. By having a connected line, it gives the continuity from one value to another. It is very useful to show trend in changes throughout the times. Unlike the bar graph, the quantitative scale of line does not necessary have to begin at zero, therefore filling the data region of the graph and making it more detailed and easier to look at. It easier to spot trends in line in comparison to bar, pie and radar graph. Although, one of the downside is the lack of understanding of the data besides spotting a trend. Furthermore, due to its simplicity, line chart is mostly used by business analyst rather than data scientist.

2. Scatter Plots

Scatter plot emphasis on correlation between data, where user can see whether or not, two or more sets of quantitative values are correlated in certain direction in a certain degree. It is also very useful in determining trend in the data and identify the any outliners or outcast. Outliners is a data value that behaves differently from other member within the data sets. With more data points that can be exploited, scatter plot can be used in a larger and more complex data for data science. Using scatter plot, data scientist can measure the actual values in the raw data and use it against the predicted values of the model. And in terms of the 4Vs, scatterplots have been able to show that it can handle massive volume of data, as well as different varieties of data sources.

3. Tree Maps

Tree Maps is used to present a hierarchical information which offers many customization features within various of categories. The founder, Ben Shneiderman, emphasis on the purpose of tree maps, being to spot a particular conditions within a large set of data rather than quantitative comparison or to rank items. Tree maps is popular amongst both business analyst and data scientist as it is very easy to understand and to interpret. Some of the advantages including high efficiency space and high scalability, which means that the method is suitable for large hierarchical structures and the efficiency allow additional graphic data to be combined into the whole tree structure, thus “Volume” and “Variety” of the Big 4s are not an issue. As for the Velocity, tree-map is able to get a real time and interactive.

4. Parallel Coordinates (PCP)

This method used plot against many individual data elements across many different dimensions and categories. It is very useful when looking at multidimensional data as it caters to many variables between different axis and data elements. PCP is more favorable for reading clustering, outlier detection and change detection. Although, some criticism for PCP arise due to its potential of being over-cluttered and not very easy to create in a value. However, this criticism can be reduced by using interactivity and a technique called “Brushing” where it provides a highlight in a selected lines or collection of lines whilst the others can fade out and it will be able to filter other noises. Furthermore, PCP ticked all three of V’s boxes as it provides platform for high dense volume, and variety, but also can be used as a real-time platform.

5. Tag Cloud

Many of the visualization tools are used to interpret quantitative data, and not many are able to represent qualitative one. Tag or Word Cloud (TC) is one of the tool that are able to tackle this problem. It is an effective way to evaluate text data and depict it into a word cloud, it is very useful in this day of age where text data through social media and website searches are dominant. Most times, tag cloud serves a certain number of most used tag in the defined areas, and the tag’s popularity is shown by the font size. Other attributes such as colors, intensity, or weight can be used as further visual properties.

6. Theme River

This visualization technique is design to enable the identification of trends or occurrences as well as unexpected occurrences within the themes or topics. The visualization was designed to cater multivariate data over a long period of times, and first was designed based on histogram. Both histogram and Theme River (TR) uses variations in width to characterize different variations in strength or degree of representation. However, TR simplify UI of tracking individual themes by providing a continuous “flow” from one data point to another.