Around the late 1990s the term “Big Data”, was launched at the Silicon Graphics Inc although it did not become a massive buzz word until 2011.
Big Data can be defined as a term, used to described the huge datasets, which consist of both structured and non-structured data. These data sets can be very complex, however with techniques and various types of tools, this can enable the collecting, storage, cleansing, extract of the data to be analyzed. The analyzed data can offer great benefits to various types of industry.
There is a massive market for companies for all types of industries to know what people want. For example, the television company might what to know what types of programs people like to watch? This means the company could stream the data from a live feed such as Facebook or twitter. As the internet, has grown people are now communicating at a fast rate with large volumes of data being produced. Big Data consists of may attributes, which is known as three Vs – Volume, Variety, and Velocity.
These attributes can be described in detailed below in the table:
|Volume in relation to Big Data means, the size of data which can range from terabytes to petabytes. A terabyte can store the same amount of data equal to the storage of 1500 CDS. The volume is the main attribute of Big Data because the size of the data sets can be massive.
|Variety is the structural context of the dataset. This means that the dataset can be constructed with various types of data, from structure to non-structured data. Structure data is a data, that is structured correctly and requires no cleansing methods. Non-structured data is data which may contain inconsistent, incorrect, or missing data within the datasets. Datasets can have both types of data. There are various types of software available to cleanse the data and this can amend any missing or inconsistent data within the datasets
|The speed or the frequency in which the data can be generated is the velocity. Collecting the Big Data is not necessarily done all the time in real-time, it can also be collected via streaming for example, streaming live feed in Twitter. Therefore, the data can be obtained as quickly as possible.
The term Big Data refers to the data, the type, size and rate however the data has no relevance until the data goes through a process called Data Analytics.
Big Data Analytics
Analytics is using tools and techniques, to analyze the data and extract any relevant data from the datasets and streaming data. Data Analytics is a term which is used to describe the techniques used to examine and transform the data from datasets and streaming data into relevant information which can be used to predict certain future trends.
The data can be used to produce reports from querying the data, it offers a prediction or interpretation of the data. For example, a dataset is located, on most popular cars bought over the last five years. When the dataset is checked for inconsistencies like missing data or incorrect data and then the data is cleaned. The cleaned data can be displayed in a bar chart or graph to visual the cars display bought over the last five years. Basically, Data Analytics turns the cleansed data into actionable data or information. There are various types of analytics, text analytics, audio analytics (LVCSR systems and phonetic-based systems), video analytics, social media analytics, social influence analysis and predictive analytics.
Relationship between Big Data and Data Analytics
All relationship has a bond, the data is the connecting bond between Big Data and Data Analytics. Although Data Analytics would not be possible without Big Data, as Big Data is the first stage in the process of Data Science. Big Data, or more importantly the data sets are not relevant until the data is processed or analyzed. The analytics side of the relationship turns the data into useful or important data that can predicate future trends. With the correct techniques and tools this relationship can produce extremely productive information.
Machine learning is the element of Data Science, that computes the algorithms effectively to construct the data models. Machine learning is an artificial intelligence that allows the computer to compute, by not having to be explicitly programmed. Machine learning allows the development of programs that can expanded and change when the new data is added. Machine learning has the intelligence to predict the patterns in the data and alter the program accordingly.
Machine learning algorithms are categorized into three different types which is supervised, non- supervised and semi-supervised. Supervised can be described as the input and output variables which use algorithms to be mapped from the input to output accordingly. Supervised can be sub–divided into two sections: classification and regression. Unsupervised algorithms have an input variable with no corresponding output variables and can be also sub-divided into two categories: clustering and association. Semi-supervised is where data is considered to between the supervised and non-supervised. This project, will use the supervised machine learning algorithms along with unsupervised such as clustering.
Data Terms used in Big Data
- Data Mining – Knowledge discovery in databases (KDD) is another name given to data mining. Data mining is a more in-depth method of analyzing data from different dimensions. Read: Data Mining
- Data Cleansing – Data cleansing is a term given to the cleaning of data within datasets or huge amounts of data. When data is, collect or recorded, there is always an area of error or inconsistency with massive amounts of data. To cleanse the data, each data entry must be checked for missing or incorrect data entries. This can take a long time to achieve but there are software programs available to speed up the process of cleaning the data.
- Clustering/Cluster Analysis – This method involves gathering data of the same cluster/group together into one cluster. Basically, it is the grouping of a cluster of a similar task into a group. The groups or cluster are observed as a cluster and analyzed as a cluster.