Optimize your data warehouses, not just your warehouses13 July, 2022
The importance of optimizing warehousing, maintaining a structured storage system, and clear record keeping are tried and tested methods of warehouse management.
As production companies grow, storage inevitably becomes an issue. How and where should we store parts used in the construction process? How about finished products? What about spare parts? Should they be stored on-site or off-site? Do we want a centralized or decentralized system? How do we even keep track of all this stuff which is coming from different suppliers who don’t even use the same naming system?
These are issues which are not only present in the physical world, but also the digital one – data storage. Just like in physical world, the choices we make on how our data is stored and designing a system which optimizes that storage approach can have a massive impact on your bottom line.
Data is said to be the 21st century’s most valuable asset and companies which have a means to harness its power have a distinct competitive advantage. Today it’s not just a story of how to optimize our warehouses, whether to store onsite or offsite, and deciding how big of a warehouse to build, it’s also about optimizing your data warehouses, data lakes, and storage locations to store effectively while meeting the needs of your modern data teams and those valuable analytics and machine learning applications that come from them.
Before we delve deeper, let’s first cover a few key terms in today’s ever-changing modern database management system (DBMS):
Databases vs. Data Warehouses vs. Data Lakes
All of these have one thing in common: They all store data! But how they store it and their uses (and costs) are very different.
Let’s start with the basics – Databases:
Loosely defined, a database is a collection of organized data representing something in the real world. The most common type of database is for structured data (data stored in rows and columns) – although not the only type – and is often referred to as a SQL database. Ever wonder where the instruments monitoring wind speed on a turbine or pressure on a gas pipe are keeping the data? The same place your local bank and jeans retailer do when you swipe your credit card to modify the balances – A database. Data is captured live and sent to a database for live storage. As such, databases often need to be able to take high amounts of data on the fly and update frequently.
Ok, so what’s a Data Warehouse then?
While databases are designed to capture data, Data Warehouses are designed for analytical processing (OLAP) on historical data. Data will generally be brought from multiple sources (such as your databases) via ETL pipeline and summarized into a strict data schema. As a source of historical data, it doesn’t need to be updated as often as your database – perhaps hourly, daily, or even monthly. Your warehouse is where you store primarily structured data optimized for analysis to be accessed by your Analysts and BI teams for later processing.
When to use a Database vs a Data Warehouse?
Ensuring your database’s integrity is vital. You and your customers need to be able to trust the data in your database just like you need to be able to trust your bank not to double charge you.
Great, so let’s get our Analysts to analyze our databases and gain insights from it! For the same reason, your databases should generally be optimized for transaction recording (OLTP), it shouldn’t be a source of data analysis. Running frequent SQL or NoSQL queries and updates on your database can cause a significant slow down in its run time leading to delayed and outdated data appearing on dashboards or possibly even creating errors leading to data integrity issues. So how do we solve this issue? Simple – analyze your data warehouse! Once historical data has been sent to the warehouse for storage, it is free to be poked around in without worrying about causing issues to your data or customers. Keeping a rigid schema of data storage in your warehouse is not only something which allows for data processing automation, but may return it’s value numerous times over to your organization.
Hooray! Then what exactly is this Data Lake you’re talking about then?
A Data Lake is a little bit like what it sounds like – A vast area we can store large amounts (often in petabytes) of raw, unprocessed data. Don’t know what to do with your data? Cool, throw it in the data lake. Got sound and image data of your pumps and generators you don’t know why you recorded? Cool, throw it in the data lake. Don’t even know where this data came from? Cool, throw it in the data lake.
Data Lakes store large amounts of data in any format from structured to partially structured to unstructured (think nice tidy tables, to web browser JSON data, to images or sound files) typically coming from a large variety of sources.
When to use a Data Warehouse vs a Data Lake?
So why are we storing all this anyway? While your Data Warehouse stores highly structure data for planned analysis later (typically your predictable analytics sent to your dashboards etc.), your Data Lake is data that you hope to gain your deeper knowledge from. This is the data which is the playground of your Data Scientists who can use machine learning (ML) techniques on the raw data to gain insight and built models. Maybe that weird sound data you recorded from your pumps could be exactly what you’re looking for! Your Data Scientists may be able to build an ML model to look for patterns in the sound data, indistinguishable to the human ear, which can detect issues ahead of time to optimize maintenance scheduling and avoid downtime.
How to optimize your storage
Well, this is grand and all…. But how does this help me optimize anything?
Much like when you’re reaching a point of deciding if you are better off building a new physical warehouse or storing off-site and how big of a warehouse to build (because construction isn’t free, right?) – the same issues exist in the world of data.
Speed and frequency vs size.
The key things to take into account to optimize your data storage are balancing the need between speed and frequency vs size.
How often does your data need to be accessed and updated? If the answer is frequently (maybe to the second or more) then you need a system set up for this. Put simply, the more we access our data, the more expensive it is to maintain (particularly if stored off-premise). This data and its databases also need to be maintained to ensure integrity. To minimize costs, you want a database of the right size for your task. Here, size is not an asset. A series of smaller databases may well fit your situation better, be able to update more reliably and quicker, and at lower cost.
If however we are talking about massive amounts of data that will be accessed infrequently (or perhaps never), a storage solution optimized for size is likely a better choice for you. Here, we’re looking for cheap storage, lots of it, and preferably storage which can auto scale to meet your needs rather than charging for space you’re not using. You are likely going to run into issues on the hardware side and other issues such as potential overheating leading to damage and potential data loss are possible issues to consider. Keeping it onsite may or may not be the best solution here depending on the cost of hardware and electricity in your location – not even considering reliability issues with having all data in one location.
To store on site or on a cloud provider
The final main factor to consider is whether it’s better to store your data onsite or off-premise via a cloud provider such as Amazon’s AWS, Microsoft’s Azure, Google’s GCP, or one of a variety of other cloud providers.
The traditional approach to storing data is to store it on premise. This requires the need for constant server access and ensuring adequate storage capacity. This involves a certain amount of planning and investing ahead of time as the last thing you want to do it have a server outage or lack of storage space available leading to lost data. Conversely, it’s also not wise to be paying for and maintaining storage capacity which is not being used.
Storing on the cloud provides an avenue for easy data storage and data access. Instead of planning ahead and estimating how much space will need to be bought, nowadays it is also possible to rent cloud storage space, meaning an easily scalable solution is right at hand and can be automated to only charge you for what you actually use. As well as storage, cloud providers can also provide data analysis and ML solutions with the compute done on the server, at the site of the data.
In many cases, there are tens or even hundreds of options of type of storage to be chosen from to suit your needs at hand. In the cloud, even more so than with on-site data storage, choosing the right option for size vs frequency of update is even more important to avoid paying excessive fees to maintain systems not suited for your needs
Which solution is right for you?
Finding which solution is correct for you largely depends on your situation. With a smaller company with little data to manage, simply using a few relational databases managed on a MySQL or Postgres server may suit your needs at a very low cost. As your company grows, you will likely find yourself needing to maintain a system of databases, data warehouses, and a data lake to meet your needs and finding an optimal database management system solution to optimize performance and minimize costs. Likewise the decision to go fully or partly on cloud (and on how many clouds to use as your jurisdiction may require) is a question that needs to be carefully answered based on your unique situation. Choosing the wrong one may end up costing you tens of thousands of dollars a month, while choosing the optimal solution will not only lead to an efficient, maintainable system, but potentially one which can return its value many times over in valuable BI and ML knowledge.
Need solutions to optimize your data store problems? Our team of data experts at Keel Solution are here to help.
Enjoyed this article? Check out these others from our Keel Data series:
Link – The cost of poor data in a data driven world – Keel Solution
Oil and Gas Asset Data Management
Harness the technology advancements and incorporate the Industry 4.0 into your Oil&Gas businessMore
Windpower Competency Center
Opening the door of Industry 4.0 possibilitiesMore
We are ready to help!
Request consultation, ask a question or share your feedback. Just get in touch!