Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.
As organizations ramp up their efforts to be truly data-driven, a growing number are investing in new data lakehouse architecture.
As the name implies, a data lakehouse combines the structure and accessibility of a data warehouse with the massive storage of a data lake. The goal of this unified data strategy is to give every employee access to data and artificial intelligence and use it to make better business decisions.
Many organizations clearly see the Lakehouse architecture as the key to upgrading their data stacks in a way that provides greater data flexibility and agility.
Indeed, a recent Databricks survey found that nearly two-thirds (66%) of survey respondents use a data lakehouse. And 84% of those who don’t currently use one would like to.
MetaBeat will bring together thought leaders to offer advice on how metaverse technology will change the way all industries communicate and do business October 4 in San Francisco, CA.
“More companies are implementing data lakehouses as they combine the best features of both warehouses and data lakes, giving data teams more flexible and easier access to the most current and relevant data,” said Hiral Jasani, senior partner marketing manager at Databricks.
There are four main reasons why organizations using data lakehouse models do this, says Jasani:
Improve data quality (cited 50%) Increase productivity (cited 37%) Enable better collaboration (cited 36%) Eliminating data silos (cited 33%)
How data quality and integration affects a data lakehouse architecture
A modern data stack built on the lake house solves data quality and data integration problems. It leverages open source technologies, leverages data management tools, and includes self-service tools to support business intelligence (BI), streaming, artificial intelligence (AI), and machine learning (ML) initiatives, explains Jasani.
“Delta Lake, an open, reliable, high-performance and secure layer of data storage and management for the data lake, provides the foundation and enables building a cost-effective, highly scalable lakehouse architecture,” said Jasani.
Delta Lake supports both streaming and batch operations, notes Jasani. It eliminates data silos by providing a single home for structured, semi-structured and unstructured data. This should make analytics simple and accessible to the entire organization. It enables data teams to incrementally improve the quality of their lakehouse data until it is ready for downstream use.
“Cloud also plays a big role in modernizing the data stack,” continues Jasani. “The majority of respondents (71%) say they have already adopted the cloud for at least half of their data infrastructure. And 36% of respondents cited multi-cloud support as one of the most important capabilities of a modern data technology stack.”
How silos and legacy systems hold back advanced analytics
The many SaaS platforms that organizations rely on today generate large amounts of insightful data. This can provide a huge competitive advantage if managed properly, says Jasani. However, many organizations use siled, legacy architectures that can prevent them from optimizing their data.
“When business intelligence (BI), streaming data, artificial intelligence and machine learning are managed in separate data stacks, it creates more complexity and problems with data quality, scaling and integration,” emphasizes Jasani.
Legacy tools cannot scale to manage the increasing amount of data, and as a result, teams spend a significant amount of time preparing data for analysis rather than actually extracting insights from their data. The survey found that respondents spent an average of 41% of their total time on data analytics projects focused on data integration and preparation.
In addition, learning how to differentiate and integrate data science and machine learning capabilities into the IT stack can be challenging, says Jasani. The traditional approach of setting up a separate stack for AI workloads no longer works due to the increased complexity of managing data replication across platforms, he explains.
Poor data quality issues affect almost all organizations
Poor data quality and data integration issues can have serious, negative consequences for a business, confirms Jasani.
“Almost all respondents (96%) reported negative business impacts from data integration challenges. These include reduced productivity due to increased manual work, incomplete data for decision making, cost or budget issues, data stuck and inaccessible, a lack of a consistent security or governance model, and a poor customer experience.”
Plus, there are even greater long-term risks of business loss, including disengaged customers, missed opportunities, brand equity erosion, and ultimately poor business decisions, says Jasani.
Related to this, data teams want to implement the modern data stack to improve collaboration (cited by 46%). The goal is to have a free flow of information and it enables data literacy and trust within an organization.
“When teams can collaborate with data, they can share metrics and objectives to impact their departments. Using open source technologies also promotes collaboration as it allows data professionals to leverage the skills they already know and use tools they love,” says Jasani.
“Based on what we see and hear from customers in the marketplace, trust and transparency are cultural challenges that nearly every organization faces when it comes to effectively managing and using data,” continued Jasani. “If there are multiple copies of data in different places in the organization, it is difficult for employees to know which data is the latest or most accurate, resulting in a lack of confidence in the information.”
If teams can’t trust or rely on the data presented to them, they won’t be able to gain meaningful insights that they trust, Jasani emphasizes. Data stored in different business functions creates an environment where different business groups use separate sets of data, when they should all work from a single source of truth.
Data lakehouse models and advanced analytics tools
Organizations that usually consider Lakehouse technology are those that want to implement more advanced data analytics tools. These organizations are likely to use many different formats for raw data on cheap storage. This makes it more cost-effective for ML/AI use, explains Jasani.
“A data lakehouse built on open standards offers the best of data warehouses and data lakes. It supports various data types and data workloads for analytics and artificial intelligence. And a common data store provides greater visibility and control over their data environment, enabling them to compete more effectively in a digital world. These AI-driven investments can deliver significant revenue growth and better customer and employee experiences,” said Jasani.
To achieve these capabilities and address data integration and data quality challenges, survey respondents reported that they plan to modernize their data stacks in several ways. These include implementing data quality tools (cited by 59%), open source technologies (cited by 38%), data management tools (cited by 38%) and self-service tools (cited by 38%).
One of the important first steps in modernizing a data stack is to build or invest in infrastructure that allows data teams to access data from a single system. This way everyone works with the same up-to-date information.
“To avoid data silos, a data lakehouse can be used as a single home for structured, semi-structured, and unstructured data, providing a foundation for a cost-effective and scalable modern data stack,” notes Jasani. “Companies can run AI/ML and BI/analytics workloads directly on their data lakehouse, which will also work with existing storage, data and catalogs, allowing organizations to build on current resources while having a future-proof governance model.”
There are also several considerations that IT leaders need to consider in their strategy for modernizing their data stack, explains Jasani. They include whether they want a managed or self-managed service, product reliability to minimize downtime, high-performance connectors to ensure easy access to data and tables, timely customer service and support, and product performance capabilities to handle large amounts of data.
In addition, leaders should consider the importance of open, extensible platforms that provide streamlined integrations with their data tools of choice and enable them to connect to data wherever it resides, recommends Jasani.
Finally, Jasani says, “There is a need for a flexible and powerful system that supports diverse data applications, including SQL analytics, real-time streaming, data science and machine learning. One of the most common missteps is the use of multiple systems – a data lake, separate data warehouse(s) and other specialized systems for streaming, image analysis, etc. Having multiple systems makes it more complex and prevents data teams from accessing the right data for their use cases.”
This post Why data lakehouses are the key to growth and agility
was original published at “https://venturebeat.com/data-infrastructure/more-organizations-see-data-lakehouses-as-the-key-to-growth-and-agility/”