Blog | 17 September 2020
Data warehouse Q&A with Mitesh Desai
Head of Mobility and Security
In this Q&A session between Kocho’s Technical Director, Mitesh Desai, and COO George Georgiou, we’ve covered some frequently asked questions about Data Warehousing.
We hear the term data warehouse a lot, and it means a lot of things to a lot of people. Could you give me a definition of what a data warehouse is? In addition, we hear the terms ‘traditional data warehouse’ and ‘data warehouse in the cloud’. Can you tell me what the differences are?
From our perspective, data warehouses are the means to be able to collect data from different data sources, so structured data sources or unstructured data sources. From that we can create a single layer of data that becomes your sort of master copy, and then the rest of the business can tap into that data
So whether it’s to create some visualizations or just get some actionable insights from that data to make some operational changes or strategic changes within the business, that’s where the modern data warehouse is coming in. The infrastructure side of of putting a data warehouse together with current cloud technologies is much faster. You can start using one of these cloud based modern data warehouses within five days and have data coming in ingesting, transforming and actually producing some results. Whereas, with traditional data warehouses you would have to build the infrastructure – servers, your data management, all of the services that are now available out of the box that just tap into your current data.
A common concern I hear is that the cloud and especially the process of going to that cloud based data warehouse is very expensive. What would the cost comparisons be for on-premises versus the cloud version?
It’s actually the opposite. We’ve seen traditional data warehouses that take a long time to put together and maintain and manage – and also the costs based on licenses and especially enterprise software licenses can be huge.
I guess the biggest difference is, you can turn on and off technology that you don’t need in the cloud. You can turn off the data warehouse when it’s not actually providing you any valuable insights. And everything that you’re running in that data warehouse is based on a consumption model
So if data doesn’t need to be adjusted or transformed or the data sources are only coming in once a day, that’s all we’re actually running. So the costs are very small. One example is where we’re seeing a data warehouse that’s taking over twenty five million feeds, which is a vast amount on a daily basis, yet it still costs less than the traditional data warehouse.
OK, so moving on from the definitions. What would be a typical use case, because I guess when we talk about data warehouses, it’s normally within the affordability range of enterprises or very large mid-sized businesses. What is it that brings cloud-based data warehouses within the reach of the SME?
I think a number of things, especially now, it’s the technology that’s available out of the cloud platform itself, so without having to engage data scientists, data experts, SQL experts, there’s a lot that’s already available in the actual cloud platform. So small to medium sized businesses can take advantage of that with code-free setups and then use partners like us to accelerate that process when necessary. But the vast majority of putting a data center together is totally invisible, and a lot of the concepts are simplified in the cloud.
That means that you can, as a small business, start taking advantage of creating visualizations and getting some insights into your business via a quick method. One of the drawbacks of a traditional data warehouse is that if a new data feed comes in or a column changes in Excel, or the data structure changes, you’re looking at reengaging your data team, which is comprised of three or four people. In a cloud based solution it just learns that there’s a new problem or different type of data and starts matching it together. It then gives you suggestions on how to incorporate that data.
I want to hone in on some examples. Let’s say I operate a financial services organization, I do a lot through spreadsheets. Traditionally, we’ve not really been in the position to build a large scale data warehouse environment. We have a lot of spreadsheets, and we take all our feeds from the core applications that we use. Extracts as CSV files in most cases. Is that a suitable source for a data warehouse?
Yes – sources could be unstructured data or structured data, so they could come directly from applications, from Excel spreadsheets, from SharePoint, from live streaming data. So whether it’s feeds from Twitter or from your own feedback forms, whatever the structure, the data can come from a vast number of places. The data warehouse collects that data and creates a uniform timestamp and then stores it within the database. So you can then tap into that on whatever device you happen to be using, and it could even lead directly from the business’ social channels, CRM or sales force. And so you have the ability to replace these time consuming processes of creating spreadsheets and extracts from applications, then merging them together to create the Excel report. It can all be done with the data injection process of the data warehouse.
OK, so I have the spreadsheets and historically, if want to go into an analysis I have to open my current spreadsheet alongside all the historic sheets. Will I still have to do that? And will any of my historic data get overwritten?
In the data warehouse, there’s multiple versions and copies kept. So the structure of the data warehouse is very similar to operating a word document in SharePoint where you have multiple versions. You also can create historical points in the data so you can go back in time as well. Most of the Azure SQL databases start with a terabyte of data and then the cost of the data stored in cloud platforms is relatively cheap. So to get two, three, four hundred terabytes of data is still possible within the data warehouse without it costing the same as it would if you installed that on a traditional data warehouse.
One of the things I have to do manually at the moment is when I add new data, sometimes I have to change the format. For example if I want dates in UK date format rather than US, or if I’m bringing in phone number data and I want to remove the dial assist numbers to keep everything uniform. Do have to do that in my spreadsheets before I bring it into the data warehouse?
One of the benefits of the modern data warehouse is the data preparation option. We can use data analytics and machine learning. We tend to make it more prep and train of the data rather than the machine. So what we mean by that is that the data will come in and the machine will start learning what that data structure should look like. So, for example, if the date format should be UK, it will start rearranging that data, making sure you can format if it’s missing data, changing dial assist codes.
You said you can actually prep and train, so does that mean we are using AI to clean the data? And what else is possible with that?
In the Azure world there’s a concept called Azure databricks, which is an open source language that helps you to code in machine learning. It gives you three options. You can use Scholar, Python and other appropriate language sources to be able to code in machine learning format. The area that we specialize in is Scholar coding, which allows you to program, to train the data, to say if this should have a UK date format, then this is an example of the UK date
It will start learning that and then go through your data sets and replacing where it sees any other data format. This can be expanded to many other uses around machine learning. So just this is changing dates and formats, but you can actually go away and grab information that will help you validate that data. Going back to your example of a financial organization, if you want to validate any data that you’re collecting, or getting as part of an application form or a website based form, you can then get the database to search out data from companies house or another reputable website, compare it with what’s being put in, and then flag up any exceptions where the data is either incorrect or misleading. So the data preparation can go from just updating formats, to getting the data in the right areas, to actually validating the data as well.
I’ve now got all my data in the data warehouse from external sources, I also run a CRM dynamics and I’ve got a finance system, a business system, a practice management system that have APIs. Can I now build dashboards in a visualization tool to bring everything all together or I just limited to the data in the data warehouse?
The data that sits in data warehouse can be exposed using API management. So there’s two key benefits of doing that. You can directly give access to other businesses, to your data securely through an API. So the days of FTPing or providing these over secure email or getting access to the data by sharing has changed dramatically. When you get to this model, you can securely share an API and then allow businesses to connect directly to the data that you’re servicing
And also you can connect directly to the same data by using visualization tools such as PowerBI and Tabular, and also create dashboards that you can then surface around the business and put back on the websites as well. The key piece of the data warehouse is to get that single layer of data, as we mentioned earlier, on the board. So when you get to this side of the journey, you can then start using multiple tools to be able to distribute that data.
One final open question – What would be your top tip for anyone considering moving to a data warehouse?
I think it’s key to make sure that you concentrate on where your data sources are coming from, to pick sources that are going to give you valuable insights. Although we said that the consumption and the running of our data warehouse can be scaled to large numbers, you’ve also got to work out what data is valuable and bring that data so you can create a nice uniform layer of data that the whole business can use
As an example of that, we’ve had a couple of recent engagements with customers where they’ve said they’ve got terabytes of data that they’ve been collecting and they feel it’s right for them now to look at data warehouse solution in the cloud so they can add some intelligence to the data preparation. And when we actually look at that data, it is not the right data that actually is going to add any value to the business. Data that they could have collected, such as conversations or training material could have added a lot more value for what they do. Rather than collecting data that is only going to be used once. And so a good tip would be before you embark on this journey and start ingesting data from multiple sources, have a look at what the data is going to provide value for in terms of what you need to deliver for the business.
Ready to ‘Become greater’?
When you sign up to our mailing list, you’ll get the best content, expert resources, and exclusive event invites sent directly to your inbox.
Mat is Kocho’s Head of Mobility and Security. He leads a team of consultants and architects that live and breathe secure transformation – delivering excellence across Microsoft 365 and Azure.
Latest blog articles
Azure AD B2B vs B2C: What are the key differences between Microsoft’s external access products?
The definitive guide to Azure AD: Everything you need to know
Securing your path to passwordless authentication: A quick guide to modern sign-on methods
The definitive guide to Microsoft Sentinel: Everything you need to know to get started with Microsoft’s cloud SIEM
We’re here to help you on your journey towards becoming greater. Get in touch to find out how.