According to Gartner: “big data is high volume, high velocity and high variety information assets that need to be cost effective, innovative forms of information processing for enhanced insight and decision making.”
Big data describes immense bulk of both structured and unstructured data. Data referred to as big data is so massive that it is beyond the processing power of traditional databases and software solutions. On enterprise level, too large data, rapidly changing data or data that is difficult for processing is referred to as Big Data. Literal meaning of the term Big Data obviously refers to stuff related to volume of data but this may not be the case for many dealers of big data. For instance an organization might refer Big Data in terms of data that they have to deal with on a large scale. But a service provider of that organization will be referring big data as the technology provided by them to their customers for processing their data, according to Webopedia.
Normally data in pentabyte’s or Exabyte’s is considered Big Data. Data of such volume can have many sources: mainly social media or data sources maintaining records related to huge inputs and rapidly changing information. And off course there can be many more sources of Big Data. Organizations that have to deal with such incomplete and inaccessible data mainly face the challenge in its operation and management. Reason being that tools are not designed and processes are not built publicly as a solution, to deal with such massive data, according to Webopedia.
According to an article from Gartner’s Doug Laney, at present big data can be defined by its: Volume, Velocity and Variety.
- Volume: In terms of Size.
- Velocity: In terms of its flow.
- Variety: In terms of versatility.
The challenge however remains how to extract the useful data out of big data, which can contribute to cost reduction and time efficiency for an organization. We can quote here the success story of one of the largest online purchasing website: Amazon. They are using Big Data to build a recommender system that recommends products to their visitors.
Every organization has its own strategy to manage big data. General methods to strategizing the handling of Big Data will be highlighted in this article. Below are some much generalized ways of handling big data along with their drawbacks:
- Keeping the useful inputs from the Big Data in less memory will cause trouble eventually. So it is better to increase the memory of machines involved in dealing with Big Data.
- Storing objects in hard disks and analyzing them chunk wise can be an effective solution but chunking may lead to parallelization.
- Data can be sampled if it cannot be catered in one go. But there are chances of effects on performance of the overall solution.
- Integrating high performance programming languages (like C++ or Java) with the existing infrastructure that is mainly handling big data.
- However there are some more recommendations at application level as well regarding much better, mature and cost effective ways to manage big data below:
- Using column oriented data bases enables massive data compression as compared to row oriented data bases. Column oriented databases also offer very good query time as compared to row based databases.
- Usage of schema less databases focus on the storage and retrieval of large amount of unstructured, semi-structured and structured data. Performance is achieved however with some of the restrictions traditionally associated with orthodox databases, such as read-write consistency, in exchange for scalability and disseminated processing.
- Map reduce is a programming paradigm that offers job execution functionality at a much bigger level with a considerable amount of servers and clusters on servers on board. Map Reduce has many open source implementations available namely: Hadoop, Hive (that consists of SQL like bridge that allows conventional bi applications to run queries against hadoop cluster. It was originally developed by facebook), PIG i.e. similar to Hive. It consists of Perl like language, PLATFORA: a platform that converts user’s queries into Hadoop jobs automatically.
Most of the platforms that are explained above are cloud supported. As the requirement of storage grows with the increase in volume of data there must be some sort of storage techniques available, such as data compression and storage virtualization. SkyTree is a data analytics platform focusing on specifically handling big data, as mentioned by TechRepublic. At present every enterprise needs command over big data handling. Some basic questions concerning big data are: what it means to an enterprise? How its efficient handling can benefit an enterprise? And how to impact value of business using big data in marketing. As discussed in the article above answers to these questions on enterprise level can inevitably contribute in adding value to business.