It’s not uncommon to see data described as the currency of the modern digital economy, suggesting a level of importance and value exponentially higher than in previous times. Getting to this point did not happen overnight, but rather through a series of developments on many fronts of the data story, such as ubiquitous connectivity and more available compute power for number crunching. Taken together, these factors have ushered in the era of data. The concept of big data has emerged as a new way to handle new forms of data, but it’s best understood as part of a larger data strategy.
Unique Characteristics of Big Data
As big data practices have matured, the definition of big data has also evolved. Originally, big data was marked by “the three V’s”—volume, variety and velocity. Over time, additional characteristics have been added to the definition in an attempt to describe a complete data management strategy. Variables such as veracity, validity and value should certainly be considered when trying to extract insights from data, but many of these characteristics could easily be applied to any type of data strategy. In terms of defining a threshold for big data, the original three traits still work well.
However, defining a threshold for big data can still draw attention away from the fact that these traits also might apply to a full spectrum of data strategies. Basically, big data refers to the use of new technology tools to handle data that previously could not be handled with the existing tools. There are many companies trying to drive better data practices that can still use traditional tools for their growth. The three V’s can describe the threshold where new big data tools are needed, but they can also describe other changes that companies may make:
- Volume: For many years, the primary tool companies used for data collection and analysis was a relational database. While disk space was a primary bottleneck, there were also limitations around the ability of a database to perform computations. Generally, relational databases can handle datasets in the hundreds of gigabytes. As companies build datasets in the terabyte range, they need to shift to non-relational databases. However, consider a company making a shift from 250GB of data to 500GB of data. This doubling of data may require a new storage strategy or better tools, but those tools may still be relational databases.
- Variety: Another characteristic of relational databases is that they operate on highly structured data. The procedures that manipulate and analyze the data expect specific formats, and only a small percentage of overall data can be represented in such a way. Non-relational databases, along with data manipulation algorithms that take advantage of advanced computing power, offer a mechanism for capturing unstructured data and extracting value. Before venturing into unstructured data, companies may have an opportunity to gather more structured data from different parts of their operations, which will help refine data practices before juggling multiple formats.
- Velocity: As data drives more and more decisions, there is more and more desire to make those decisions quickly. In many cases, this is a matter of enhancing business offerings or customer service. In some cases, though, real-time decisions are much more critical (consider examples in healthcare or public safety). Real-time data collection and analysis remains a significant challenge even for data-savvy companies thanks to issues with network latency or processing time, so most firms will not need to push the envelope to that extreme. There could still be advantages, though, to building processes that drive faster data cycles.
Moving Toward Big Data
Regardless of where a company sits on the continuum of data usage, there are some steps that should be taken to ensure a solid foundation for an ongoing data strategy. Big data tools and techniques have limited use if a business does not have solid processes in place.
The first step is understanding all the data within the company. Most businesses report some degree of data silos (and some of the businesses that do not report data silos may simply be unaware of their existence). Modern data techniques typically assume that the full set of data is accessible so that connections can be made between different components. In order to get insights that will drive business growth, there must be full knowledge of how current data is handled and a robust plan for gathering any new data in the future.
As part of understanding the corporate data blueprint, a business must understand the way that data is stored. There are a wide variety of storage options available, from local datacenters to devices to a variety of cloud offerings. Again, the storage should ideally be tied together in some way, and different storage options should be used depending on how often the data might be needed.
Storage is just the first of many tools in the growing data toolbox. Many types of databases, analytic software, and visualization packages exist, all offering unique functionality for specific types of data. Depending on the data a company currently has, the data they expect to gather, and the goals they have for the data, companies should choose the right applications for their data architecture.
Finally, security and privacy are crucial considerations for today’s environment. As data has become currency, there are a number of ethical questions that have been raised with regards to the use of data. Legacy security practices will be insufficient for data in a cloud and mobile world, and transparency regarding data collection will be a key factor in maintaining trust with customers and third parties.