From Data Warehouses and Data Lakes to Data Fabrics for Analytics
The evolution of data architecture for analytics started in earnest with relational data warehouses. While these systems have been effective at generating information from historical data and therefore provided a basis for predictive modeling, ultimately they are not very agile or responsive for the volume and variety of data to which the users. companies are facing today.
The Data Lake was the next advancement in analytics-based architecture, primarily because it quickly accommodated the various schemas and types of data that organizations were processing at scale. But the way he reflected this diversity left a lot to be desired. Because they are fundamentally enterprise file systems, data lakes typically turn into ungoverned data swamps requiring extensive engineering for organizations to connect and query the data. As a result, a lot of time is spent discussing the data which, although it is physically co-located in the data lake, still has no connection to business significance, resulting in productivity suffering and new information is missing.
And while data lake houses combine some of the best properties of data warehouses and data lakes, it is too early to make a serious judgment on their usefulness and they ultimately suffer, as they are ultimately indistinguishable from relational systems. , an inability to deal with the problem of corporate data diversity. Relational data models are just not very good at handling data diversity.
Properly implemented data factories represent the latest evolution in analytics architecture by dramatically reducing the effort data engineers, data scientists, and data modelers spend in data preparation compared to the aforementioned approaches. which are all based on the physical consolidation of data. With a clever combination of semantic data models, knowledge graphs, and data virtualization, a data lakes approach allows data to stay where it is natively, while providing uniform access to that data, which is now connected according to their business significance, for a timely request. respond across clouds, on-premises, business units and organizations.
This method streamlines the complexity of data pipelines, lowers DataOps costs, and provides significantly reduced time to obtain analytical information.
Knowledge graphs
Knowledge graphs play a critical role in the enhanced analysis that comprehensive data fabrics can provide to organizations. Their graphical foundations are essential for discerning and representing complex relationships between the most diverse sets of data in order to greatly improve understanding. In addition, they easily align data of any variation (unstructured, semi-structured and structured data) into a universal graphical construction to provide organizations with healthy and streamlined access to the mass of structured, semi-structured and unstructured data with which they are confronted with. .
When querying customer data for training datasets appropriate for machine learning models, for example, knowledge graphs can detect relationships between individual and collective attributes that are beyond the capabilities of conventional relational approaches. They are also able to make intelligent inferences between semantic facts or business knowledge to create additional knowledge about a specific area; for example, the dependency relationships between supply chain partners. The combination of these capabilities means companies know more about the importance of their data to specific business processes, outcomes, and analytical concerns, such as why some products sell more in summer in specific regions than others. , which inherently creates more relevant and meaningful results.
Expressive semantic modeling
Semantic data models, i.e. richly detailed with real-world knowledge of concepts, occurrences and problems in terms that business users understand, and their ability to determine the relationships between data and to creating intelligent inferences about them are the backbone of semantic knowledge graphs. They also simplify the types of schema problems that monopolize the time of data modelers and engineers when preparing data for analysis with other approaches.
These data models naturally develop to accommodate new types of data or business requirements, while relational data models, for example, typically require modelers to create new or updated schemas and then migrate. physically the data according to the new schema. A process that increases rather than decreases the time to gain insight. This advantage not only resolves data conflict issues, but it also improves the real-world knowledge depicted in data models, which in turn improves business users’ understanding of the relevance of data to use cases. analysis.
Data virtualization
Finally, virtualization technologies are at the heart of enterprise data fabrics. They provide consistent representations of data that is accessed uniformly at an abstraction layer, regardless of the physical location of the data. In this approach, the data remains in the source systems, but is still accessible and queried in one place within this virtualized framework or data fabric. This approach dramatically reduces the need for data replication, which is expensive and time consuming with typical ETL jobs for complex data pipelines that increase DataOps costs.
The data factory approach based on the business significance of the data rather than the location of the data in the storage layer also decreases the spread and reliance on the data silo culture; it does this in part by connecting rather than consolidating data between complex enterprises and supporting shared hosting at the schema layer. For example, individual business units such as marketing and sales teams can access their data through the same corporate fabric with their analytics tool of choice, without creating additional silos, without moving and copying data with pipelines. expensive data. Such functionality promotes data sharing, data reuse and more comprehensive information by connecting data across the enterprise: in multiple clouds, on-premises and edge environments.
The epitome of analytical architecture
The place we all want to be is for business users to put data together to support better understanding. An analytical architecture that is fundamentally based on where data is in the storage layer struggles to bring the business to the desired end state, especially given the rapid growth of hybrid and multi-cloud environments. Data factories are the cornerstone of modern analytical architecture, especially when implemented with data virtualization, knowledge graphs, and expressive data models. This combination of features eliminates much of the time, cost, and labor spent on setting up data pipelines and DataOps. It does this by connecting data at the compute layer level rather than physically consolidating it at the storage layer level, resulting in a more contextualized and better-matched meaning and therefore greater speed of to analyse.
About the Author
Kendall Clark is Founder and CEO of Stardog, a leading provider of Enterprise Knowledge Graph (EKG) platforms.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: @ InsideBigData1 – https://twitter.com/InsideBigData1