Enjoy the benefits of coordinating the work of data scientists and analysts
Organizations can get more value from their data if data scientists and IT data analysts work together. This includes the sharing of this data. Here are three ways to do this.
Data scientists come from a world of research and hypotheses. They develop queries in the form of big data algorithms which can become quite complex and which may not yield results until after many iterations. Their natural computer counterparts, data analysts, come from a different world of highly structured data work. Data analysts are used to querying data from structured databases and quickly see the results of their queries.
Understandable conflicts arise when scientists and data analysts try to work together, as their working styles and expectations can be very different. These differences in expectations and methodologies may even extend to the data itself. When this happens, the architecture of computer data is called into question.
SEE: 4 Steps to Purge Big Data from Unstructured Data Lakes (TechRepublic)
“There are many historical differences between data scientists and computer data engineers,” said Joel Minnick, vice president of product marketing at Databricks. “The two main differences are that data scientists tend to use files, often containing semi-structured machine-generated data, and often have to respond to changes in data schemas. Data engineers work with structured data with one goal in mind (for example, a data warehouse star schema). “
From an architectural point of view, this means for DBAs that data for data scientists needs to be built into file-oriented data lakes, while data for IT data analysts needs to be sorted into data warehouses that use traditional and often proprietary structured databases.
“Maintaining proprietary data warehouses for the Business Intelligence (BI) workloads that data analysts use and separate data lakes for data science and machine learning workloads has led to a complicated and expensive architecture that slows down the ability to leverage data and confuses data governance. “Minnick said.” Data analytics, data science and machine learning must continue to converge and therefore we believe the days of maintaining data warehouses and data lakes are numbered. . “
This would certainly be good news for DBAs, who would be happy to have to manage only one data pool that all parties can use. In addition, eliminating different data silos and converging them could also go a long way in eliminating work silos between data science and IT groups, thereby promoting better coordination and collaboration.
SEE: Snowflake Data Warehouse Platform: A Quick Reference (Free PDF) (TechRepublic download)
As a single data repository that anyone could use, Minnick offers a “data lake” that combines both data lakes and data warehouses into one data repository.
“The Lake House is a best of both worlds data architecture that builds on the open data lake, where most organizations already store the majority of their data, and adds the transactional support and performance needed to make it happen. ‘traditional analysis without giving up flexibility,’ says Minnick. “As a result, all major data use cases, from streaming analytics to BI, data science and AI, can be accomplished on a unified data platform. “
What steps can organizations take to migrate to this all-in-one data strategy?
1. Foster a collaborative culture between data scientists and data analysts that addresses both people and tools.
While data science and computer data analysis groups have developed independently of each other, organizations may need to develop a team spirit and collaboration between the two.
On the data side, the objective will be to consolidate all the data in a single data repository. As part of the process, data scientists, IT data analysts, and the DBA will need to partner and collaborate on standardizing data definitions and determining which datasets to combine so that this standard platform can be built.
2. Consider creating an enterprise data center of excellence (CoE)
“Data science is a rapidly evolving discipline with an ever-growing set of frameworks and algorithms to enable everything from statistical analysis and supervised learning to deep learning using networks. neurons, ”Minnick said. “The CoE will act as a force function to ensure communication, the development of best practices and that data teams move towards a common goal.”
From an organizational point of view, Minnick recommends that the CoE be placed under the leadership of a data manager.
3. Link the unification effort of data science and data analysts to the business
A set of shared goals and data can contribute to a stronger and more integrated corporate culture. These synergies can accelerate the achievement of business results, and it’s a win for everyone.
“For organizations to get the most from their data, data teams need to work together rather than data scientists and data engineers each operating in their own silos,” said Minnick. “A unified approach like a data lakehouse is a key factor in enabling better collaboration, as all members of the data team are working on the same data rather than siled copies.