In recent years many enterprises have begun experimenting with using big data and cloud technologies to build data lakes and support datadriven culture and decision making—but the projects often stall or fail because the approaches that worked at internet companies cannot be adapted for the enterprise, and there is no comprehensive practical guide on how to do that successfully.
Big data, data science, and analytics in any form that fuels our decision-making by bringing in insights to our every walk of life – serving customers with the right products to finding a cure for cancer depends heavily on the historical data. Companies recognize this fact and invest in building data lakes that help them bring all their data together in one place and start saving history. This enables data scientists and analysts to have access to the information they need to enable datadriven decision-making. Enterprise big data lakes bridge the gap between the freewheeling culture of modern internet companies, where data is core to all practices, everyone is an analyst. Most people can code and roll their own data sets and enterprise data warehouses. Data is a precious commodity, carefully tended to by Data engineers and provisioned in the form of carefully prepared reports and analytic data sets.
For enterprise data lakes to be successful, they must provide three new capabilities:
- Costeffective, scalable storage and computing, so large amounts of data can be stored and analyzed without prohibitive computational costs.
- Costeffective data access and governance, so everyone can find and use the correct data that eliminates the labor cost assosciated with software programmers hired for writing queries for ad hoc data acquisition
- Robuts data governance and user access policies that allows tiered, secured, reilable access to different users based on their needs and expertise
Hadoop, Spark, NoSQL databases, and elastic cloud–based systems are exciting new technologies that deliver on the first promise of costeffective, scalable storage and computing. As they continue to mature rapidly, stabilizing and becoming mainstream, they face some challenges in the process, as is familiar with any new technology. However, the other two requirements of providing tiered data access and being cost effective remains unaddressed. So, as enterprises race to create huge data clusters and bring in large sets of data and create a data swamp—a large repository of unusable data sets that are impossible to navigate or make sense of, and unreliable for making any quality business decisions rather than a data lake.
A high quality and effective data lake supports selfservice, where business users are able to find and use data sets that they want to use without having to rely on help from the IT department. Second, it aims to contain data that business users might possibly want even if there is no project requiring it at the time.
So what does it take to have a successful data lake? As with any data project, aligning it with the company’s business strategy and having executive commitment and broad buyin is a must.
In addition, based on our experience with several companies deploying data lakes with varying levels of success, three key prerequisites can be identified:
- The right platform
- The right data
- The right interfaces