Data onboarding and storage: the foundations of analytics and AI

5 min readAug 10, 2018

Everyone tends to focus on the flashy things you can do with data. But, data onboarding and storage is fundamental to doing the headline-grabbing stuff. Trust in data is paramount. Without knowing your data has been ingested and stored correctly, you can’t trust the insights that it provides. That’s why you need to start with the right processes to onboard and store it.

Data onboarding and storage is a critical part of getting AI-ready. AI models need to be trained on data. If the data is poor quality, you cannot trust the outputs of the AI. Build a robust platform that allows you to carry-out complex activities such as training an AI. A clear pipeline of data, that is trustworthy and can be read by the AI, will set you up for long-term success. So, don’t rush into the clever stuff without doing the basics first.

Data storage by any other name…

The term ‘data lake’ has become as ubiquitous as data itself. Go to any data-focussed conference and you’re sure to see a few data lake solutions on show. Everyone has a different interpretation of a data lake, and it can often come with a lot of baggage.

I prefer the term ‘data platform’. What most people want is a platform that enables them to not only store a wide range of data but also access and use it in a wide variety of different ways. Data platforms encompass this.

Moving on from data warehouses

As for data warehouses, there used to be a couple of versions in common use. Kimball was one of them. In some circles, whatever data warehouse you chose became something of a religion (there were Kimball followers and Inmon fans).

In the past, there was a great need to be efficient in your data storage design. However, now, with the increased computing power available, that need isn’t as critical. Some things that were complete no-nos in the past, like having repeat data in more than one place, are sometimes acceptable. If the use case calls for it, and you have enough storage, then why not?

Storage options

Storage wise, we’ve never had more variety in off-the-shelf options. The variety of different systems means that there isn’t a one-size-fits-all for organisations. It means that you can choose your system based on your use cases.

In fact, we recommend this approach. Look at your use cases first, then build your storage solutions from that.

Most organisations are likely to be well-served by a core, tabular style, relational database. The majority of organisational data is in this format. Plus, the skills needed for dealing with this database technology are widely available. You won’t have to invest too much in hiring specialist team members. Consider this as a starting point. As your use cases get more advanced, begin to explore other storage solutions.

One example where a different style of storage is needed is when mapping out relationships across people. Like when an organisation wishes to map out its internal talent — which department speaks to others, who in a team solves the most problems, and where common failure points occur. A graphical database that can easily visualise this information is the best bet in this scenario.

As the complexity of use cases and their required storage increases, the available people you can recruit to deal with it becomes smaller. This makes certain projects most costly than the ones you first begin with. When you reach this point, you must recognise what projects are worth investing in, and what should be on hold. Do a cost/benefit analysis of each and every use case.

Effective data onboarding

As well as considering storage solutions, you must build efficient data onboarding routines. If these are wrong, your delivery speed will suffer along with the trust in your data. As a start you should:

Build frameworks: This creates a standard for delivery across your organisations and makes code easy to support. It also makes it easier to change different bits of it if they aren’t working for you.
Don’t try to get it right first time: This feeds into the last point I made. Don’t expect everything to be perfect the first time. It’ll have to go through a few iterations before working for your organisation and use case.
Keep ingesting data: Perhaps a bit controversially, I recommend ingesting a lot of different data sets and always adding new sources. As long as you predict a use for it that aligns with your business goals.
Don’t get hung up on data format: Start with what you want to do first. From there, work out what data sets, whether structured or unstructured, you’ll need.
Manage your metadata: Enrich your raw data with a standard set of metadata. This should include information about where the data came from, when it came in and its process ID.
Don’t ignore errors: Ignore error handling and logging at your peril. If you do the bare minimum with this, it becomes a nightmare to unpick if something goes wrong.
Audit data as it comes in: There are many potential issues that can arise as data is ingested. It’s important to validate your data with records of its distinct values in specific columns, for instance. If the quality of your data is called into question, then this comes in very handy.
Data security is vital: The security of your data is a whole new topic within itself. As a start, consider what needs to be encrypted, when, and how sensitive data is handled. Your ingestion routines should take out any personally identifiable information (PII) before analysis, as it’s rarely useful to analytics anyway.

It’s also important to note that it’s no longer efficient to extract, transform and load data in the old way. From a trust and quality standpoint, you should not transform your data as you ingest it. Keep it in its raw format and transform it after. This way you can always go back and see exactly what it looked like when you loaded it.

Data ingestion tools

There are a few options for different ingestion tools:

Open source: You can use open-source, although this might require some investment in employees with the right skills to deal with this, and you’ll need to consider long-term supportability.
Buying tools: Most organisations do some of the on-boarding heavy lifting with tools like Talend or Oracle Data Integrator. For specific use cases, they might then use custom-built solutions.
Custom-built: Building a tool yourself offers greater flexibility, but it’s often out of budget unless an organisation is very large.

Data onboarding and ingestion is the start

Without investment in data onboarding and storage, your data projects will falter. You want to be able to trust the data quality so that you can rely on the findings from using it. Before you start ingesting and storing data, consider your use cases.

Everything should stem from your use cases. This will tell you what data you need to collect, and the best storage solution for it. Developing good ingestion routines and data storage sets your organisation up for the future. If you want to use data, you need to trust it. That starts with data onboarding and storage.

Jason Foster — Founder & CEO — Cynozure