How to develop business insights from big data using Microsoft’s Azure Synapse and Azure Data Lake technologies.
Data lakes are an important part of a modern data analysis environment. Instead of importing all your different data sources into one data warehouse, with the complex task of building import pipelines for relational, non-relational and other data, and of trying to normalise all that data against your choice of keys, you wrap all your data in a single storage environment. On top of that storage pool, you can start to use a new generation of query tools to explore and analyse that data, working with what could be petabytes of data in real time.
SEE: Windows 10 Start menu hacks (TechRepublic Premium)
Using data this way makes it easier to work with rapidly changing data, getting insights quickly and building reporting environments that can flag up issues as they arise. By wrapping data in one environment, you can take advantage of common access control mechanisms, applying role-based authentication and authorisation, ensuring that the right person gets access to the right data, without leaking it to the outside world.
Working at scale with Azure Data Lake
Using tools like Azure Active Directory and Azure Data Lake, you can significantly reduce the risk of a breach as it taps into the Microsoft Security Graph, identifying common attack patterns quickly.
Once your data is in an Auzre Data Lake store, then you can start to run your choice of analytics tooling over it, using tools like Azure Databricks, the open-source HDInsight, or Azure’s Synapse Analytics. Working in the cloud makes sense here, as you can take advantage of large-scale Azure VM instances to build in-memory models as well as taking advantage of scalable storage to build elastic storage pools for your data lake contents.
Microsoft recently released a second generation of Data Lake Storage, building on Azure Blobs to add disaster recovery and tiered storage to help you manage and optimise your storage costs. Azure Data Lake Storage is designed to work with gigabits of data throughput. A hierarchical namespace makes working with data easier, using directories to manage your data. And as you’re still using a data lake with many different types of data, there’s still no need for expensive and slow ETL-based transformations.
Analysing data in Azure Synapse
Normally you need separate analytics tooling for different types of data. If you’re building tooling to work with your own data lake, you’re often bringing together data-warehousing applications alongside big data tools, resulting in complex and often convoluted query pipelines that can be hard to document and debug. Any change in the underlying data model can be catastrophic, thanks to fragile custom analysis environments.
Azure now offers an alternative, hybrid analytical environment in the shape of Azure Synapse Analytics, which brings together big data tooling and relational queries in a single environment by mixing SQL with Apache Spark and providing direct connections to Azure data services and to the Power Platform. It’s a combination that allows you to work at global scale while still supporting end-user visualisations and reports, and at the same time providing a platform that supports machine-learning techniques to add support for predictive analytics.
At its heart, Synapse removes the usual barriers between standard SQL queries and big data platforms, using common metadata to work with both its own SQL dialect and Apache Spark on the same data sets, either relational tables or other stores, including CSV and JSON. It has its own import tooling that will import data into and out of data lakes, with a web-based development environment for building and exploring analytical models that go straight from data to visualisations.
Synapse creates a data lake as part of its setup, by default using a second-generation BLOB-based instance. This hosts your data containers, in a hierarchical virtual file system. Once the data lake and associated Synapse workspace are in place, you can use the Azure Portal to open the Synapse Studio web-based development environment.
Building analytical queries in Synapse Studio
Synapse Studio is the heart of Azure Synapse Analytics, where data engineers can build and test models before deploying them in production. SQL pools manage connections to your data, using either serverless or dedicated connections. While developing models, it’s best to use the built-in serverless pool; once you’re ready to go live you can provision a dedicated pool of SQL resources that can be scaled up and down as needed. However, it’s important to remember that you’re paying for those resources even if they’re not in use. You can also set up serverless pools for Apache Spark, helping keep costs to a minimum for hybrid queries. There is some overhead when launching serverless instances, but for building reports as a batch process, that shouldn’t be an issue.
Azure Synapse is fast: building a two-million row table takes just seconds. You can quickly work with any tabular data using familiar SQL queries, using the Studio UI to display results as charts where necessary. That same data can be loaded from your SQL store into Spark, without writing any ETL code for data conversion. All you need to do is create a new Spark notebook, and then create the database and import it from your SQL pool. Data from Spark can be passed back to the SQL pool; allowing you to use Spark to manipulate data sets for further analysis. You can use SQL queries on Spark datasets directly, simplifying what could otherwise be complex programming tasks unifying results from different platforms.
SEE: Checklist: Securing Windows 10 systems (TechRepublic Premium)
One useful feature of Azure Data Lakes using Gen 2 storage is the ability to link to other storage accounts, allowing you to quickly work with other data sources without having to import them into your data lake store. Using Azure Synapse Studio, your queries are stored in notebooks. These notebooks can be added to pipelines to automate analysis. You can set triggers to run an analysis at set intervals, driving Power BI-based dashboards and reports.
There’s a lot to explore with Synapse Studio, and to get the most from it requires plenty of data-engineering experience. It’s not a tool for beginners or for end users: you need to be experienced in both SQL-based data-warehousing techniques and in tools like Apache Spark. However, it’s the combination of those tools and the ability to publish results in desktop analytical tools like Power BI that makes it most useful.
The cost of at-scale data lake analysis will always make it impossible to bring to everyone. But using a single environment to create and share analyses should go a long way towards unlocking the utility of business data.
Source of Article