Meet Bruno, Datavillage's Data Engineer

Submark datavillage logo
Datavillage Oct 26, 2021
Image from i OS 4

Can you please introduce yourself and tell us more about your background ?

My name is Bruno Coussement, data engineer and avid road cyclist. I did my masters in mathematical engineering and computational sciences. During that time, I’ve also co-founded a student data science organisation (emergentleuven.be) back in the days it was about to become cool (2015).

After that I stayed in the academic world, and did 2 years of fundamental ML research in tensor-based methods at the KU Leuven. The only part I still use today is how to create, train, tune ML models in a robust, fast and scalable way. At that point I wanted to pop out of the comfort of the academic bubble, and gain hands-on experience actually using ML models in the real world. This realisation led me to pivot towards data engineering consulting at Data Minded.

During those 3 years, I basically experienced how hard it is to industrialise data products across different industries, organisation sizes and “data maturities” first hand. Secondly, I was also involved in data privacy projects where the idea was to use personally identifiable information without divulging it. We’ve ran pilots with synthetic data generation (read https://edps.europa.eu/press-publications/publications/techsonar/synthetic-data_en), differential privacy (https://towardsdatascience.com/understanding-differential-privacy-85ce191e198a) but all these tools never seem to strike a usable balance (see a more technical post I wrote about it https://medium.com/datamindedbe/how-to-share-tabular-data-in-a-privacy-preserving-way-c72a59c7602f). There should be another way, I thought.

Why have you chosen to work with Datavillage ?

I have learned about Datavillage while randomly browsing De Tijd, a Belgian newspaper. The following question came to my mind: are they doing something fundamentally different than the approaches I’ve already played with? After a first virtual meeting with Quentin and Frederic, the founders, the answer was a clear yes!

Rather than dealing with all the increasingly stringent regulatory compliance challenges associated with centralising personal data, let’s keep the data in an environment that the user (or consumer) owns and controls. Even better, this decentralisation of data control should also incentivise organisations to come up with better data solutions than they would today. Give the end-users an actual reason why they should give access to their precious.

Personally, the intriguing technical challenge I signed up for is how organizations will provide improved data solutions without actually seeing it. Secondly, I wanted to gain more experience in actual product development instead of more consulting.

Finally, if you believe in the following proverb: “you become your environment”, then it is important to look at who you will work with. Everyone at Datavillage brings experience to the table I wish to learn from.

What is your role at Datavillage ?

My exact role as a data engineer is broad.

My main goal is to increase productivity and speed to embed new features on our platform. As mentioned above, bringing a data solution to production easily takes a year when doing it for the first time. But with the right set of practices, the amount of time before going live can be reduced to a few weeks, without increasing headcount or technical debt, freeing up my teammate’s or client data team’s time to work on the next project. The key here is to automate everything, offer self-service components, and streamline processes. Engineers should go from building everything from scratch to just linking components together.

In parallel, I impose security best-practices and enhance reliability to ensure continuous operation of our solutions. Many data solutions are stopped after a certain amount of time because they no longer provide value and no one can figure out why that’s the case or how to fix them. Having continuous monitoring, product efficacy testing integrated in our workflow should hopefully help it.

Thirdly, all of this exhaustive automation, testing and embedding of reusable components should reduce the risk of using our platform. For example, continuous testing reduces the impact of manual errors. Reusable components should have clear structure, use, risk considerations, lowering the probability of error and rendering upgrades of it less daunting.

As a side-effect of doing all the things mentioned above, my hope is to attract and retain good people in our team. Most technical talent get excited about doing cutting-edge work with the best tools that allow them to focus on challenging problems and seeing the impact of their work in production. Without a robust platform and practices, top tech talent will quickly become frustrated by working on transactional tasks (for instance, submitting IT tickets to obtain X or Y, data cleansing, data integrity) and not seeing their work have a tangible business impact.

A technical contribution I would be proud of is making confidential computing an actual verifiable reality. This is the flow in which, assuming the end-user has given consent, the algorithm of a client will be unleashed on the personal data. Contrary to popular belief, this is not complicated and should not be for transparency's sake. The devil here is in the implementation details. A balance must be found between cost, flexibility and manageability without compromising on transparency, verifiability and consistency. A good old engineering challenge!

All of those efforts ultimately result in a solid tool that Datavillage’s customers and end-users see clear value in using in the nearing data decentralised world.

Give us your predictions, what are the challenges of today and what do you think technology will be in the next 10 years?

Here is my abstract answer:

Today we still need to truly grasp the specific set of challenges associated with globally centralised organisations like AGAF, some NGOs, but also large nation states. They are omnipresent, super powerful and have private interests not aligned with society. The thing is that it is in their own interest to grow this large. The rules of the game are such that this form currently tends to survive best. But faced with more and more complex global threats (just think climate change for a minute), the economics of decentralisation is naturally gaining more pertinence. Add political will to push regulations towards decentralization, and slowly but steadily we find ourselves in a world that has a better shot at taking these issues head first.

Secondly, we as humans always have looked at nature to inspire our designs. The concept of emergent behavior has proven that it overall works well. Combining many simple systems instead of building one complex holistic one, has proven its merit.

What does this mean technology wise? Any technology that embraces the concept of decentralization, is in my eyes future proof assuming that the rules of the world change.

Here is my less abstract but shorter answer:
More systems will be able to talk to each other increasingly more easily, resulting in a whole new breed of technologies.

According to you, what is the number one challenge of digitized organizations today ?

A big challenge today is that a lot of small and medium sized companies are still in their discovery phase of becoming what I call “data literate”. Just look at the increasing number of open positions as data analysts, scientists, engineers, translator, chief data officers. They often struggle to fill in those positions. At last, many early endeavours easily fail due to a plethora of reasons. Any tool or service that helps organizations mature faster and demonstrates actual measurable value has a strong business case.

Mature organisations face a different set of problems. I am talking about pains they only feel after a certain amount of time:

  • their engineering team has grown from 20 to 100’s or 1000’s,

  • governance of who can access what data becomes harder to do,

  • the type of use cases become more diverse (for example suddenly processing of a stream of user events becomes a thing you will need to do across 20 projects),

  • they cannot pretend data-privacy regulations don’t exist anymore,

  • the number of tools in general used grows exponentially,

This results in the development of a niche set of tools to solve exactly those problems. So any tool that gives rise to standardisation, better observability, higher reliability, de-liability has a good chance of being adopted by large organizations.

Submark datavillage logo
Datavillage