Lessons learned from real-world cloud data lake deployments.

Photo by Jong Marshes on Unsplash

This past July Dremio sponsored Subsurface, the industry’s first cloud data lake conference. The event explored the future of cloud data lakes and included discussions with open source and technology leaders at companies such as Netflix, Expedia, Preset, Exelon, Microsoft and AWS about their real-world experiences spearheading open source projects and building modern data lakes.

During the event, speakers shared details about their journey of building modern cloud data lakes. They talked about their challenges, what worked and what didn’t and, most importantly, how they succeeded. The following are the key takeaways.

Fundamental Principles of a Cloud Data Lake

The ability of cloud data lakes to become…


Photo by Donald Giannatti on Unsplash

Introduction

In today’s digital landscape, every company faces challenges including the storage, organization, processing, interpretation, transfer and preservation of data. Due to the constant growth in the volume of information and its diversity, it is very important to keep up to date and make use of cloud data infrastructure that meets your organization’s needs. The data architect is essential to this mission, as he or she can effectively solve issues related to:

  • cloud data sources
  • cloud storage
  • security
  • data structure
  • data processing
  • scalability and availability
  • speed of working with data
  • and much more

Data architects…


The explosion of data and the need for business agility to leverage that data for competitive advantage are driving a massive surge of data lake innovation. As with pretty much every other technology, that innovation is happening almost exclusively in the cloud, and it is coming from all angles: open source projects, commercial vendors, and of course the cloud providers themselves.

The industry has moved past first-generation Hadoop-based data lakes and is now squarely focused on building next-generation data platforms based on an open cloud data lake architecture. Organizations of all sizes have recognized that cloud data lakes, with separation…


Photo by Stephen Dawson on Unsplash

Explore how to analyze data and build informative graphs in a productive way using Python and Dremio

Intro

Data analysis and data visualization are essential components of data science. Actually, before the machine learning era, all data science was about the interpretation and visualization of data with different tools and making conclusions about the nature of data. Nowadays, these tasks are still present. They just became one of many miscellaneous data science jobs. Very often, the so-called EDA (exploratory data analysis) is a required part of the machine learning pipeline. It allows a better understanding of data, its distribution, purity, features, etc. Also, visualization is recommended to present the results of the machine learning work to different stakeholders…


Photo by RKTKN on Unsplash

A data lake engine is an open source software solution or cloud service that provides critical capabilities for a wide range of data sources for analytical workloads through a unified set of APIs and data model. Data lake engines address key needs in terms of simplifying access, accelerating analytical processing, securing and masking data, curating datasets, and providing a unified catalog of data across all sources.

The tools used by millions of data consumers, such as BI tools, data science platforms, and dashboarding tools, assume all data exists in a single, high-performance relational database. When data is in multiple systems…


Photo by Tom Gainor on Unsplash

Explore how Kafka can be used for the storing of data and as a data loss prevention tool for streaming applications

Introduction

Apache Kafka is a streaming platform that allows for the creation of real-time data processing pipelines and streaming applications. Kafka is an excellent tool for a range of use cases. If you are interested in examples of how Kafka can be used for a web application’s metrics collection, read my previous article.

Kafka is a powerful technique in a data engineer’s toolkit. When you know how and where Kafka can be used, you can improve the quality of data pipelines and process data more efficiently. In this article, we will look at an example of how Kafka can be applied…


Photo by Ilya Pavlov on Unsplash

This article demonstrates how Kafka can be used to collect metrics on data lake storage like Amazon S3 from a web application.

Metrics collection

Metrics are the indicators (values) that reflect the state of a process or a system. When we have a sequence of values, we can also make a conclusion about trends or seasonality. In summary, metrics are indicators of how the process or the system evolves. Metrics can be generated by applications, hardware components (CPU, memory, etc.), web servers, search engines, IoT devices, databases and so on. Metrics can reflect the internal state of the system and even some real-world processes. …

Lucio Daza

Sr. Director of Technical Product Marketing. I love the thrill of the chase when searching for answers in the messiest of data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store