IQ Data Platform.
Wortell IQ data platform (EN)
Modern organizations more and more require reporting and analytics capabilities to run business effectively and competitively. At the start, organizations requiring reporting usually begin building the first reports as PoCs, directly connecting to (on premises) operational data sources. When these reports and requirements grow, organizations find this becomes unmanageable.
The data that underpins these functionalities has to be combined, prepared, enriched and stored in a form that is more useful to grow from ad-hoc reports/analytics on operational data to an enterprise that can rely on its data as a basis for critical business process.
The Data Platform is a key component in a Data driven organization. It provides a scalable and future proof foundation for Business Intelligence, Analytics and optional AI workloads. Implementing this platform in a Cloud Service model ensures it can start small and grow easily when the demand rises.
The Wortell IQ data platform gives organizations the possibility to implement an Enterprise Quality Data Platform, containing all required services. It can start small at a pay-as-you go cost and can grow to any size.
Some usage examples the Data Platforms are;
A Standardized datawarehouse, set up in days, monitored and maintained according to best practices from Wortell and Microsoft.
It helps customers to reliably build reports on business data.
Having a datawarehouse enables reports to be implemented faster, more accurate, with less complexity and by developers that need less knowledge of the underlying process and source system.
The datawarehouse is used to extract data from one or more sources and to combine this in a structure that is suitable for reporting about processes. The business logic that applies to these processes is applied when loading the data. This enables report builders to faster build reports on this process data while needing intimate knowledge of the process.
A datawarehouse periodically extracts data from source systems, transform and store it. This process can be very compute intensive and should be finished within limited time.
For this, the servers executing these workloads are sized to handle this peak load.
However, since the ETL process only takes place a few times a day (or each hour), this means that this expensive hardware is mostly doing nothing.
The modern cloud datawarehouse model enables the components to automatically scale up to maximum CPU burning performance during these peak load moments while scaling down to a smoldering cheaper size during the moments the system is not used.
For some high security and privacy sensitive solutions the cloud is also a better solution due to the Azure security features that are nowadays superior to those on premises.
Wortell CloudCost Software
CloudCost is a Wortell SAAS product that relies on the standardized IQ Data Warehouse. It enables customers to get insight in the costs and trends that Azure and Office 365 generate by department, solution and business application. For each customer a new instance is automatically deployed via Azure DevOps. We are one of our own large customers. This improves the quality of the solution and operations of the IQ platform for all customers.
Big data processing with Azure Synapse
When the amount of data is very large and the solution pattern is fitting, Synapse is used. Synapse and Databricks enable customers to process large data volumes and to extend their analytics capabilities. This can be combined with SQL DB, data lake storage or Analysis Services to hold the aggregated result and to limit unneeded Synapse processing costs.
Students use the IQ platform to experiment with compute intensive AI and Analysis solutions in a sandbox that is specific to their research assignment.
The challenge is to provide students with the AI, ML and Analytics capabilities while also keeping control of the potentially large costs these ML experiments can generate. The IQ platform is extended with budget monitoring and control solution and a Data Catalog to manage and share datasets. The solution grants the researchers insight in their available budget and shuts down their individual sandbox environment when budgets are exceeded.
A code-first solution
Wortell has deployed many data platform infrastructures. At the base of most of those data platforms are requirements and components that return time and time again. We have gathered these and created a product that incorporates these best practices and is deployed or updated in an automated manner.
Some other components are common but not essential for all implementations. These we have added as optional features that can be added at any time. In many cases these are added as the platform is successful in the organization and new workloads are added.
Examples of these options are;
- DevOps processes to update and test functionality without impacting production versions
- Big Data batch or (IOT) stream processing
- Reliably run and manage Machine Learning workloads
The IQ platform is implemented as a code-first solution and deployment is automated. Practically this means that we use parameter files and can quickly deploy or update the platform by running the deployment script. Our templates are maintained continuously to ensure this always is up to date with the latest standard on security and manageability.
For the deployment of the Data Platform, a functional Azure environment is required.
It is assumed that an instance of mission critical Azure is deployed and the environment is secured, monitored and managed. Deployment of this Azure Blueprint can also be part of the project as an optional component.
The base implementation of the following components that work together;
- Azure SQL DB to store datawarehouse datamodel and data
- Azure Data Factory to orchestrate data processing and to connect data sources
- Azure KeyVault to securely store secrets like passwords, connections strings, etc
- Azure Virtual Network to separate the the Data Platform from the internet
- Azure Data Lake hierarchical storage to store datasets.
Optional components that can be added;
- DTAP environment including DevOps CI/CD release processes
- Azure Synapse SQL DWH
- Azure Databricks
- Azure Data Catalog
- Azure Event Hubs
- Mission Critical Azure (if not already present)
Azure SQL DB
This is the most cost effective way of storing data in a structured way so that this is most useable for e.g. reporting workloads. Implementing a data warehouse in SQL DB enables e.g. Power BI developers to quickly create reports and opens up possibilities for self service BI.
SQL DB is a much more cost effective solution than Synapse SQL DB and the best solution when data is manageable (e.g. in GB’s and not TB).
By default this is connected to a VNet to ensure a secured environment.
Azure Data Factory
A new data platform has no data yet. ADF is used to connect to data sources that are on premises or cloud based. The data is ingested and stored for further processing.
The processing of the datasets is orchestrated via Data Factory. Pipelines execute activities that have the data processed and loaded via Azure Spark (Data Flow), SQL stored procedures or optionally Databricks or Synapse.
The ADF can have a small selfhosted runtime installed in the on premises environment to securely and encrypted communicate with the on premises infrastructure without the need to set up dedicated VPN tunnels.
Connecting ADF to a VNet is available to us as a preview feature.
All secrets like passwords, connection strings, tokens are stored in the Key Vault as Name-Value pairs. This vault is only accessible for security personal that needs access to the actual passwords.
The Data Platform Azure services use the secrets in the KeyVault to configure Data Platform services, without the actual values being visible to users managing the services. (E.g. the DataFactory is configured using the value ‘SalesForcePassword’ without having to know the actual password.)
This also makes password rotation much easier and allow CI/CD release pipelines to deploy solutions with identical configuration across DTAP environments.
The values in the KeyVault can only be accessed from the VNet. When multiple groups of security teams need access to different keys, multiple KeyVaults can be deployed.
Azure Virtual Network
A datacenter in Azure is still a datacenter and needs network security to prevent direct access to resources from the internet. Some Azure services have an firewall that can be configured for the services individually. For environments containing valuable data we recommend using a private network to better manage access and network security.
Azure Data Lake hierarchical storage
This is the center for of the Data Platform. Hierarchical storage is the location where large datasets can be stored between processing or for archiving.
The Data Lake is a based on blob storage and can scale practically unlimited. It can be structured with directories to manage the different datasets location and their permissions.
DTAP environment including DevOps CI/CD release processes
To be able to make changes to a production application, a development- and testing environment is needed. The CI/CD automated deployment process guarantees releases are taken into production with minimum downtime and disturbances.
The Wortell CloudCost SAAS solution is build on the IQ Data Platform and makes heavily use of this option to deploy new releases to the dedicated instances that are provisioned for each customer.
Azure Synapse SQL DWH
For Big Data scenario’s, Azure Synapse is an massive parallel processing database system that has the highest performance that is available on the market. It can process Terabyte or even Petabyte workloads faster than any system in the world.
We have several customers that have extended the Data Platform with this feature.
We are evaluating the preview of the new Synapse Analytics Workspace that can provide this functionality at pay as you go price. The existing version is relatively expensive but the preview version is billed only for reading data and not reserved compute, combined with integrated ADF and Spark cluster.
For compute intensive ETL, analytics or machine learning tasks, Databricks cluster(s) can provide the compute power to process the data.
Azure Data Catalog
When the Data Platform grows or the amount of users is large, the Azure data catalog manages the existing datasets, owners and details that help users finding the correct dataset they need for their analytics or reporting work.
Azure Event Hubs
For realtime workloads, streaming data can be captured and simultaneously stored and processed by e.g. Databricks. Results can be stored in SQL, Data Lake Storage or Synapse providing a dataset that is always up to date.
Mission Critical Azure
When Azure is not available yet, this option can quickly deploy a secure and managed Azure environment. Automated and scripted for maximum reliability and speed.