Why every company & group needs a central data insight system
The following is a transcript from my recent talk a PyData Berlin — lightly edited for readability.
Full disclaimer — I am co-founder and CEO of Kyso, which is our solution to the issue of undiscovered knowledge from data, a central hub for technical reports. While I will demo Kyso and go through why we think it is the best solution in the end, I am going to make the general argument as to why every company and research group needs a central, formalised, knowledge management system.
First, a little bit about how I got involved in this type of project! I used to work in big research groups, in which the sharing of analytical reports was very messy — which, by the way, I think is only going to get even worse now that everyone is working from home.
I build quantum computers at Cambridge and Toshiba. There was a team of 20 making semiconductor chips. We’d make them, analyse the data, and then simulate the chips to make them more efficient. The big problem was that we had no central system for sharing and collaborating on the results.
Some of us were doing analysis in Jupyter notebooks, some in R, pasting graphs into PowerPoint, which would get presented to the boss once, subsequently get dumped into a folder and never get seen again. There was no systematic way of sharing and maintaining our projects. Because of this, our team wasn't leveraging data science efficiently.
Discovery vs Sharing
The first major point of distinction I have to make is the difference between sharing and discovery. We all share analyses, typically one-to-one. We screen share, email, make presentations, discuss results on slack, or over a coffee.
But these insights get siloed within all these different sub-groups. Alice shares with Bob, but the chain stops there. Allowing people to discover your work is really really important because someone else could benefit from these privately-shared insights, someone Alice hadn’t even thought of.
There is a huge benefit for all of these different types of stakeholders having the ability to discover these results, and use them for their own work. Discovery is a key factor that a lot of companies and data science teams just don’t seem to understand, but this is an issue that has given rise to lots of different types of knowledge hubs, many of whom you will be familiar with:
Let me now take you through two different example teams that would benefit from a centralised knowledge hub for data insights.
ACME Inc. — A Typical Online SaaS Company
ACME Inc. is an online SaaS company, running a subscription-based model, charging per seat (or user) for teams using their product. The CEO, James, wants to get a better overview of their customers’ behaviour & asks Sarah, a data scientist, to make a report.
She starts working in a Jupyter notebook, imports company data from a variety of sources like MongoDB, Google Analytics, and Mixpanel and plots out some key analytics on how users are using the product.
- She plots a simple histogram of team size vs the number of teams.
- She also graphs the number of posts (think articles on Notion or Confluence) by team vs the team size, and the same for the number of comments.
- Sarah discovers that there is a super-linear dependence of number posts on team size. Teams of >400 are using the platform proportionally more. The result is the same for the number of comments made.
What happens if Sarah does not have a centralised system for sharing analyses? She just emails the report to the CEO, who reads it — and that’s it. Sarah might discuss the results with some of her immediate colleagues. The CEO might inform the board or investors. But the insights generated don’t get widely shared within the company because there is no discovery mechanism in place.
Now, what if ACME Inc. does have a central hub? Maybe they’ve even been forced to set one up due to the current crisis — everyone is working remotely. Sarah posts the report to the internal knowledge hub — Notion, Confluence, or Kyso, for example. Everyone in the company can now discover and read it.
Mary, from the product team, comes across the report. Sarah & Mary have no direct connection but Mary discovers it on the hub. And being on the product engineering team, the above dependence is really interesting to Mary and her team’s goals. Why does this relationship exist? Perhaps because, as the size of a team increases, so too does the number of connections. Or perhaps it is a political issue — the larger a company is, the more likely a VP of engineering could be pushing internal usage of the app. This insight is clearly valuable to Mary and her team.
Patrick, from the Sales & Marketing department, also discovers the report. Because Patrick has fixed monthly marketing budget, he may now decide to focus in on larger team lead sizes, or those that have the potential to be larger. Because the company charges per seat, this is clearly a relevant insight for Patrick.
How about Barbara from the infrastructure team? Maybe the cost of providing their service is not flat — and is worried that servicing more larger companies is much too expensive so they need to either make their systems much more efficient or she might make the argument internally to focus on smaller teams.
So we not only have a central system for sharing analyses but this system is now also driving communication between different departments. This is an example of a typical company and how moving from no system for curation and having a central place for results for notebooks and other data assets on how decisions can get driven within the company.
DeepTech Inc. — A Technical Research Group
The importance of sharing and discovery also applies to all-technical teams. Take DeepTech Inc., a research group & a representation of my own frustrations working within a technical team, where the lack of sharing created a lack of cumulative development of models & how work reuse could be increased with a central management system.
On my research team at Cambridge, we made chips, shone lasers through them, trying to make computer chips use photons instead of electrons, to excite quantum photonic states within the chips. We were essentially always analysing light spectrums. One of the most common things we did was to fit Gaussians really well in an automated fashion to all the spectrums we were measuring.
Imagine a team of 20 people across 10 different labs, measuring spectrums, fitting Gaussians, and we all needed sophisticated analyses with good error management, so that our models could pick out Gaussians from noisy data. But the 20 of us had different ways of doing it — some of us used notebooks and python, some used R, some excel, with no central place to post these different data assets. No place where someone could come along, read the reports of colleagues, and reuse the work. So, after 5 years, instead of having a beautiful system in the lab for analysing data in a fully automated way, everyone was, unintentionally, keeping insights to themselves.
When one of our colleagues left, I wanted to continue his project, to integrate his chip into my own system. But we could not recreate his work and had to abandon the project. We had access to his data, his notebooks, his presentations on the network drive, but they were not in any sort of structured format, just sitting in his own folder, which was a mess to other people.
Another project I was involved with, which comprised of two teams between Cork, Ireland, and Cambridge, UK, making automated optical water sensors. This involved, in simple terms, running water through a microfluidic chip, contaminating the water with E. coli and other bacteria. We would shine lasers at the water and, depending on the returned spectrum, we’d be able to tell if there was or wasn’t E. coli in the water.
I handled the laser shooting and data analysis myself. We also had a PI, who oversaw the whole project, other people who’d set up the E. coli concentrations in the water, and others making the microfluidic chips.
What I realised is that, while everyone on your team may be technical, they all write and understand code & data analysis, but they might not be technical in the same way.
For example, I would work on Github — and expect others to read my notebooks (reports) there. My PI was not on Github. The biologists just wanted to see the results of the experiment & not have to read my code or methods. The notebooks were only really relevant for the results & the discussion of the analysis, not necessarily for the mathematical methods.
These are just some of the reasons why, even for a technical team, setting up a central place for data analysis is really useful, because reuse of work increases, and there is full transparency of what everyone is working on. When someone leaves the team, their past work remains discoverable and reproducible on the central system. You will have no problem sifting through your old projects from a year ago.
All of the different stakeholders in different positions, using slightly different methods, are sharing and communicating their reports and projects on one unified platform.
Sharing Python Data Science Projects
I hope I have convinced you with the above, albeit very different, examples of the need for having a central system to share your data-based project reports.
Now what I want to share is a general guide I have learned to follow when working on data analysis projects in python.
The following is a general guide I’ve learned to follow when working on data analysis projects in python. Take an example project of connecting to a MongoDB database and visualising the data siloed there. There are a few major points to remember:
- Your data story goes in a Jupyter notebook.
- Make a Readme.md for install instructions. When someone comes across the Github project, you want the first thing they see to be instructions on how to set up the project, libraries to install, instructions on how to access the data, etc.
- Use Conda & environment.yml (or Docker) for reproducibility. The point here is to always use a re-runnable environment. I’ve worked on so many teams where the importance of this has not been emphasised enough, where people would just import a file, make a graph & email it around.
- Worked on so many teams where people have just imported a file, made a graph, and email the graph to someone. This generates an insight for a day, but in the long term it is effectively useless — nobody can reuse the project, nobody can come along can tell what the insights gained were in a systematic way.
- Push your work to Github (or another VCS) for version control.
- Post your work to a common knowledge hub for discovery, learning, and collaboration. There are lots of ways to do this.
* Generally speaking, notebooks on Github alone have become redundant. They just don’t work, especially if the notebook is large or if there are interactive graphics.
* You can, for example, write an article about the insights in Notion or Confluence, and just link to the Github repository.
* You can set up Airbnb’s open-sourced Knowledge Repo.
* Or you can sync your repositories to Kyso.
Kyso lets you post all your notebooks to one central dashboard, where they render as blog posts. You can set up connections to Github for all your different data-based projects.
Once you’ve signed up the rest of your team, you now have an internal data blog of sorts, where everyone can read your notebook analyses in the form of reports, no longer limited to technical teams only like with projects sitting on Github. There is no extra time spent writing up Markdown articles in Notion & pasting in graphics. You can do literate programming where you tell the story about the data in the notebook itself, creating a story as you go essentially, and the report is rendered automatically. All updates to existing reports that are synced with Kyso are also auto-updated.
A common workflow for larger teams with many different projects is to have different sub-directories in one data-analysis repository because there is so much library re-use across these projects. So they’ll have a common data extractor file — connecting to the various data sources — in the root folder, with the various projects coming through as separate reports on the team’s Kyso dashboard.
Note that this is all free to get started individually and it’s also free for smaller teams. So give it a go — you might find a lot of benefits for your company or group.
So in this talk, I’ve touched on the following points:
- The difference between discovery and sharing. Don’t do one-to-one sharing. Democratise access to data-based reporting. You will find huge productivity benefits when there is full visibility into what everyone is working on.
- I discussed two different examples:
* ACME Inc., a company with different stakeholders across the entire organization and why having a central knowledge management system for analytics can drive learning, productivity, and growth.
* DeepTech Inc., a research company, and how sharing accelerated development for an all-technical team.
- A simple guide to sharing python data-science projects in a systematic way.
Again, I do hope I have convinced you to set up some kind of central system for analysis and results sharing (not necessarily the solution that we’re building at Kyso). However, if you are interested in trying it out & are unsure about anything, feel free to reach out to us directly for a discussion on your team’s specific use-cases.
Feel free to check out the full talk here: