Data Workloads for Cloud & Microsoft Azure

Data Workloads for Cloud & Microsoft Azure

Types of Data, Database, Transactional and Analytical Workloads, OLTP, ETL, OLAP, ACID properties, Batch Data, Streaming Data, Azure tools.

To get an understanding of the data workflow of Microsoft Azure I divided this article into 3 parts:

Part 1 | Types of Data

Part 2 | Types of Database

Part 3 | Data workloads

Let's understand each part in detail.

Part 1 | Types of Data

  • Structured Data

Data that have a predefined schema or are stored in tabular format. For example - ERP(Enterprise Resource Planning) System, Administrative systems, CRM(Customer Relationship Management), etc.

  • Unstructured Data

Data that has no predefined structure, no labels, and no fields. Data that are harder to process automatically or processed using machine learning techniques like, audio, video, and image.

  • Semi-Structured Data

It is not tabular in nature also does not have a specific structure. The shape of the data will change over time. It is not tabular in nature. XML(Extensible Markup Language) has a structure but no schema. For example - log files, data import/export format.

Part 2 | Database types

  • Relational

Relational databases are for structured data. Structure Query Language is used for the manipulation of the data. SQL is a declarative language, which means you do not write instructions on how to execute a query but you describe the intended results.

Relational databases have a schema.

Examples of relations databases:

  • Microsoft SQL Server: It is a commercial server. With high performance and Active directory integration. Active Directory means if you have an ID running you can use AD accounts to connect to the server.
  • MySQL: It is free to use, open-source SQL DB, and easy to install and learn. Azure MySQL DB
  • PostgreSQL: It is free, open-source, has more features but it is also more complex.
  • Non-Relational

Data stored in collection or containers. In these collections, they store arbitrary snippets of data like an XML(Extensible Markup Language) or JSON (JavaScript Object Notation). Snippets with different shapes or subjects can be in a single container.

They do not follow any predefined schema.

Types of Non-Relational Databases

  1. Document Databases: It is used for storing small documents often formatted as XML or JSON.
  2. Wide column store: It is used for storing data in tables with rows and columns but without any schema, by this way each record can have a different shape.
  3. Key-Value store: It is used to store any type of value under a specific locator key. Key-Value stores can scale very well and perform incredibly well.
  4. Graph Databases: It is used for modeling specific problems. Performing queries of relationships between entities.

Examples:

  • Redis

Redis is a high performant, in-memory database, often used for caching and key-value models.

  • Cassandra

Cassandra is free to use, open-source, wide-column store, and highly distributed.

Highly Distributed makes it resistant against hardware failure and suitable for running low-end cheaper hardware.

  • Azure Cosmos DB

It is a cloud-native database for Azure. It is globally distributed. It is a multi-model all in one database including document store, graph database, or key-value store.

Part 3 | Data workloads

Workloads are divided into 2 parts.

  1. Transactional Workloads
  2. Analytical Workloads

Let's understand each of them in detail.

1. Transactional Workloads

Systems like CRM(Customer Relationship Management) that perform a high volume of reads and writes to support their users in getting information out of the system and storing updates.

It includes a high volume of record-keeping transactional systems like CRMs, Banking, etc.

This means that throughout a single day, many updates to the system come in and many users connect to read their part of the data.

As these systems answer queries with a level of authority, it is important that these systems respond fast, often within milliseconds, and the answers are always correct and complete. A workload of this type is called OLTP(Online Transaction Processing).

Transactional refers to one of the components of such a system, the Transaction.

The transaction is a set of updates of the database that are bound together and should be executed as a single whole. For example Transfer of money from one bank account to another bank account. This transaction consists of two parts, a withdrawal on one account and a deposit on another.

Both will work or neither of them.

To support this for transactional workloads, the database adheres to the ACID properties

In the bank transfer, there can be loss of power, failure, and less catastrophic (that means two transactions that want to mutate the same record).

ACID properties in Short

Atomicity

All transactions should be atomic which means, either a transaction is committed and all changes are saved, OR a transaction is aborted and no changes are saved.

Consistency

The database should be in a consistent state before and after each transaction. All schema and relational requirements are met when a transaction starts or commits.

Isolation

All transactions should be executed in such a way that it appears as if they were not running in parallel. One transaction should not see the intermediate results of another yet uncommitted transaction.

Durability

Once a transaction has been committed, the database system should make sure that the changes are stored, even after power failure.

2. Analytical Workloads

Focused on a situation where you have a lower volume of reads, but often over a larger amount of data.

It can be done with

  1. Batch-based process (Warehousing)
  2. Streaming Data.

Let's understand both in detail.

Batch Based Process (Warehousing)

Querying the data after processing is called OLAP(Online Analytical Processing).

It is a type of analytical processing. It is a process where you want to answer business intelligence questions using data that is being generated by transactional workload.

When you do this by moving a large amount of data from one database to another, and querying for answers in the second database, that is called batch processing.

Batch data is often very very large that is why they are called Data Warehouse.

OLTP ---------------> ETL ---------------> OLAP

Online Transactional Processing

Extract Transform Load

Online Analytical Processing

They are separated because of 2 reasons:

  1. OLTP and OLAP are different schemas for storing the same data. This allows it to optimize the schema for the type of queries.
  2. No shared resources, which means that it ensures that analytical queries cannot negatively impact the transactional workload by consuming too many resources.

Streaming Data

The data under investigation is not stored at all but only analyzed as it comes in and calculating the results needed.

The OLTP workload is set up to constantly export all the updates to a streaming data engine. This engine differs from a database in the sense that it does not store the data but queries instead.

Once the data is updated then all the queries are removed from the data engine.

Storing the queries and not the data, not the other way around is the fundamental difference between the steaming data engine and the database.

Any Website

|

Azure SQL DB

|

Azure Data Factory (ETL Tool) - It can run on schedule --> Pipeline(DataFlow with series of activities)

|

Azure SQL DB or Azure Synapse

|

Power BI

What exactly is a pipeline?

  • Query the information you need from the source databases and tables into Azure Data Factory.
  • Combine data from different sources which is called joining.
  • Select columns and write output to target datasets.