Every time I see some post where commits are taken as contribution metrics, I remember when, after working for months at it, I merged Redis Cluster into Redis as a single commit, and saw the pale green square appearing for that day in my GitHub contributions chart. Now it's 1.5 months that I work 10h/day at Redis Vector Sets and they will also be a single commit. It's very simple to do better than that, as a metric: different developers have different habits. For me, a stream of changes of early-day design just pollute the contribution history. Starting from a solid and advanced beta, then yes, history is great to have.
I think the authors agree with you. They tried to look at lines of code added / deleted (eg "they consistently made over 95% of the lines added to and deleted from Elasticsearch") - although the language in the article flops between that and just saying 'commits', so it's not sure what they were actually looking at for the write up. In their scraping code / dataset linked at the start of the article, they are logging `commits_list = [commit_date, dels, adds, oid, author]`.
This is also just a blog summary of a preliminary study:
> "This is the first step in a much larger research project underway [...] we’re working toward including more repositories and additional metrics to better understand the project health dynamics within these projects."
Project activity will remain inherently fuzzy. Just about everybody who programs extensively has spent a couple days to change a line or two of code at some point in their life. No metric can capture that unless we are all journaling and publishing our life activities.
Nonetheless we can do better than commits, as you said. If you review most anything online, there is a global score and then 3-5 categories with subscores. Surely the same should be true here. Freshness of LOC changes, average freshness of the overall codebase as a percent, issues satisfactorily resolved (and not closed because they are blown off, which should be a negative indicator), merged pull requests, to think offhand of a few.
What would be your top 5 categories to evaluate the "health" of a code base, admitting that any evaluation will remain a very fuzzy approximation at best?
In that case they did not evaluate it with enough care, given they gathered more information than that. Hopefully they correct that as they progress.
I am quite curious as to your take on a few metrics that would help evaluate the health of a code base. It's a dirty job, but we all have to do it every time we look for something new.
I worked once for a smallsh (~50 people) company, with a huge, unmaintainable, legacy code-base.
Said company was bought by a large US company where one of their key metrics for a developer was number of new lines of code.
It went down-hill from there. 10% of people were fired because mothership instituted job cuts globally, and then people were leaving, then another round of cuts, then most people left, then the company was sold, I think losing a fairly hefty part of its valuation.
Eventually large US company was bought by Oracle, which to my eye indicated Oracle is like MS; they have a single product, which is a massive cash cow, and for the rest, they serially make terrible decisions (a la Nokia et al).
The topic is indeed very interesting but before studying commit author diversity it would be useful to understand the volume and traction. Statistically most of the forks are dead ends, even if maintained by a few enthusiasts for some time.
I'm sure opensearch won't die until it's a commercial offering of AWS but how is going? Any new features coming, a product roadmap exists? Or it's mainly bugfixes and maintenance? What about Opentofu?
Even something basic like a graph of LOC changed over time, with a fork in the middle would help to put the article into perspective.
OpenTofu is doing really well I'd say, and only picking up steam as it's going.
Product roadmap-wise, the team has made some big improvements that have been requested by the community for years, with another big release coming very soon (I believe next week or the one after), here's some of the major ones:
- End-to-End State Encryption - lets you encrypt your state-file end-to-end, either with a key management system like AWS KMS, or static keys.
- Early Evaluation - the ability to parameterize initialiation-time values, like module versions and sources, backend configuration parameters, etc. and keep them DRY.
- (Coming in 1.9) - provider iteration, which lets you use for_each with providers, e.g. create one provider per region, something that currently requires a bunch of copy-paste, or tools like Terragrunt
- (Coming in 1.9) - -exclude flag, which is the opposite of the -target flag, letting you skip planning/applying certain resources.
Probably the best way to see a summary is check out the release blog posts for 1.7[0], 1.8[1], and 1.9-beta[2]. Many of those required non-trivial changes to existing parts of the codebase.
One of the biggest Terraform contributors has also joined Spacelift a couple months ago to work on OpenTofu. All things considered, I'm very confident that the team will be able to handle any feature it sets their minds to, and that those improvements will keep coming. There's a ranking of top-voted issues which is probably the best way to loosely see what will be tackled next[3].
Provider iteration is a really nice one - I had a big monorepo that would deploy some baseline services in many AWS accounts, across multiple regions, generating tf.json files for each provider to match all accounts that were created.
However, what really broke this model at some point was the fact that we were running so many providers instances that our Terraform Cloud would go out of memory! Since each provider instance in tf is really launching a new process it really adds up... At some point I was thinking since the engine and the providers use gRPC to communicate, it MAY be possible to distribute providers across machines, but I never investigated it further... I'm pretty sure there was a notice in the tf plugin SDK stating that it was not possible to connect them over a network... but why not? ¯\_(ツ)_/¯
Yeah, esp. the AWS provider is pretty memory-intensive.
I believe someone on the team did some investigation into this (running providers remotely) but it's not really a priority (if it is for you, feel free to voice that on the issue tracker!).
Frankly though, with pricing for cloud instances being generally linear wrt to the CPU/memory size of the instance, I don't think there's much reason to prefer many smaller machines over just using a larger single one and avoiding all this added complexity.
We’re just migrating to OpenSearch from Xapian and it was one of the question we had. We didn’t want to go from a solution in maintenance mode to one which could be a dead end soon.
From what we’ve seen there is lots of active development going on with many features being added.
But we don’t yet use any fancy features and could easily switch to ElasticSearch if need be. So we got a backup plan.
> But we don’t yet use any fancy features and could easily switch to ElasticSearch if need be.
I think the decision paralysis is the big deal here. I have the exact same situation with OpenTofu and Terraform, they're diverging rapidly yet it's not entirely clear which way the wind is blowing. They both now have compelling and interesting features that the other doesn't have.
So the outcome is that I'm now not using any new features.
In practice, and I'm extremely biased here, I'd consider the most risk-averse option to be going with OpenTofu but not using any of its exclusive new features. With this you get dependency updates and the widest competitive range of vendors in case you ever want to use a commercial orchestrator service for it.
However, it seems to me folks at companies of all sizes are increasingly deciding to bite the bullet and migrate, esp. since the last release a couple months ago. E.g. see the talk by Fidelity[0] on OpenTofu Day at Kubecon.
I'm scared of people measuring code in "lines added" and "lines deleted". Tbh sometimes good fix removes 10 lines and adds one, but good. I can also imagine that all of the merge requests are approved by the company "owning" the opensource project, hence after rebase the author will be always from this company...
I understand that the companies probably did majority of the work, However I can't put my finger on this comparison... Sounds strange and inaccurate
As with many more modern "Open source" projects, the openness is more of a "You can try for free, and once you need production level SLA come call us". While the source code is available to look at it's more of a facade as almost no one change it but the owning company. Limited by complexity of the project, maintainer politics and IP around supporting tools and materials.
OSS has basically theseus shiped into something completely different.
Not a criticism just an observation.
One key element often overlooked but mentioned in the recent post from Fusion Auth is the Business Continuity aspect. If your provider suddenly shuts down but the product was open source and self hostable, you can pay someone else to keep it working while you work on a migration plan on your own timeline.
The more open the license, the more options available.
Also from a license negotiation perspective, gives the buying company the option to threaten to self-host and/or fork. Even if they never do (and I'm sure the source company is very careful to balance the value story), it can act as a ceiling for rate increases or other annoying business practices.
Why would a company ever open-source their product then? Giving up that complete leverage can be a selling point during the purchasing process, making buyers more comfortable that they won't be (completely) locked in, and be a net positive on revenue through faster sales.
Hm, I was expecting more business point of view in this article. Right now we are looking information about financial results from relicensing open source. Unfortunately, it is about repositories health. But the article still interesting.
Are you in the UK by any chance? I'm sure OpenUK would be interested to chat more (given they've been working on research and impact analysis in this area)
An equally important question is how much of the lack of organizational diversity in forked projects was due to constraints/roadblocks to contributions imposed by the controlling companies.
> That's why forking is uncommon in open-source code, and even more so in (specifically) GPLed code: The improvements one group makes in its would-be "fork" are freely available to the main community.
Unfortunately, in the smartphone world this just isn't reality. Trying to obtain code dumps is hard enough for major brands, outright impossible for the myriad of cheap clones. And embedded is even worse, almost no one cares about distributing the GPL code of the BSP, mainly due to fear of violating chipset vendor NDAs.
> almost no one cares about distributing the GPL code of the BSP, mainly due to fear of violating chipset vendor NDAs
Most problem of BSP programming is it is really complicated, because need to fit within limits of hardware and need to have deep knowledge of DSP environment.
So it is very interest question, who will do complicated things for free, or who will dive deep for free.
Unfortunately, too many people compare apples with carrots, in this case compare definitively shallow frontend/full-stack programming vs hardcore embedded.
And returning to question, in real life, nobody want to rewrite all core code for BSP, but really use huge chunks of ready made code, provided by vendor, so, sure they have very tight coupling to vendor copyrights.
Sometimes, things are even worse with hardware limitations, which just made impossible to write other way than does vendor.
Other problem, regulations - using vendor code you automatically obey laws, or to be honest, you shift responsibility to vendor, but if write your own code from scratch, will need someway create proof that people could trust to your code.
> Trying to obtain code dumps is hard enough for major brands, outright impossible for the myriad of cheap clones. And embedded is even worse, almost no one cares about distributing the GPL code of the BSP, mainly due to fear of violating chipset vendor NDAs.
Sounds like an opportunity for the copyright holders to make some money by suing and dual licensing.
What money? You get to spend millions litigating the matter and the end result is a tarball with a couple new drivers. They do a board rev in 6mos, forget their obligations, and you're back where you started.
> The improvements one group makes in its would-be "fork" are freely available to the main community.
IANAL, but there's a caveat here, which is that a lot of these forks are due to companies relicensing to source-available licenses, which generally means they require a CLA (and full copyright license) from each of their contributors, so that they can relicense the codebase at will.
The code committed to the fork can't be pulled by the relicensed project in this case, unless it's the original contributor making a contribution to both, because such code would only be covered by the fork's license, not by the new license nor CLA.
My biggest gripes are u-boot and the Linux kernel. Both are clearly GPL only, you must provide sourcecode for your modifications including drivers as a vendor, and yet so many make one jump through hoops it's not even close to funny any more. Or they don't fulfill their obligations at all.
for a short time only improvements in one are available. Then the two diverge andchangesecannot merge. khtml couldn't bring in any changes from apples fork
The diversity measures used in this study are a fascinating window into some unusually measurable communities. The count of contributors and volume of their contributions are proxies to many things, including project popularity, ease of contribution, number of approachable fixes (tantamount to how many simple/low-hanging bugs there are), and diversity of use cases exposing the product to new situations (i.e. potential growth of project scope). These things need to align for diversity of contributors' motivations to arise and contributors to approach the project initially, but different things need to arise to sustain involvement: introduction of new bugs, need for completely new features (i.e. a growing project scope), continuing need for refinement of otherwise battle-tested code (i.e. performance gains remaining on the table), and continued relevance as other alternative packages and paradigms rise and fall. I can't wait to read their future work, and I hope it includes measures of project maturity (in the senses of feature-completeness/code quality as well as whether functional scope is growing or not). Surely there are projects that lack contributors for the simple reason that the projects are "done", and surely there are projects where engagement looks like disproportionately many shallow contributions due to immaturity of the product, and surely there are projects that have wider or shallower pools of scope to draw from (as well as management ethoses that readily take on new scope or are avoidant of the same).
Some opposing examples are the Linux kernel (eternally growing scope, with huge motivation by many user communities) and libpng (which is relatively fixed in scope, with desirements like security increasing the bar for contributions to an already mature and popular product).
Every time I see some post where commits are taken as contribution metrics, I remember when, after working for months at it, I merged Redis Cluster into Redis as a single commit, and saw the pale green square appearing for that day in my GitHub contributions chart. Now it's 1.5 months that I work 10h/day at Redis Vector Sets and they will also be a single commit. It's very simple to do better than that, as a metric: different developers have different habits. For me, a stream of changes of early-day design just pollute the contribution history. Starting from a solid and advanced beta, then yes, history is great to have.
I think the authors agree with you. They tried to look at lines of code added / deleted (eg "they consistently made over 95% of the lines added to and deleted from Elasticsearch") - although the language in the article flops between that and just saying 'commits', so it's not sure what they were actually looking at for the write up. In their scraping code / dataset linked at the start of the article, they are logging `commits_list = [commit_date, dels, adds, oid, author]`.
This is also just a blog summary of a preliminary study:
> "This is the first step in a much larger research project underway [...] we’re working toward including more repositories and additional metrics to better understand the project health dynamics within these projects."
Project activity will remain inherently fuzzy. Just about everybody who programs extensively has spent a couple days to change a line or two of code at some point in their life. No metric can capture that unless we are all journaling and publishing our life activities.
Nonetheless we can do better than commits, as you said. If you review most anything online, there is a global score and then 3-5 categories with subscores. Surely the same should be true here. Freshness of LOC changes, average freshness of the overall codebase as a percent, issues satisfactorily resolved (and not closed because they are blown off, which should be a negative indicator), merged pull requests, to think offhand of a few.
What would be your top 5 categories to evaluate the "health" of a code base, admitting that any evaluation will remain a very fuzzy approximation at best?
The data about Redis can be true only if they mean "commits". This is why I believe they checked Github contributions numbers.
In that case they did not evaluate it with enough care, given they gathered more information than that. Hopefully they correct that as they progress.
I am quite curious as to your take on a few metrics that would help evaluate the health of a code base. It's a dirty job, but we all have to do it every time we look for something new.
I worked once for a smallsh (~50 people) company, with a huge, unmaintainable, legacy code-base.
Said company was bought by a large US company where one of their key metrics for a developer was number of new lines of code.
It went down-hill from there. 10% of people were fired because mothership instituted job cuts globally, and then people were leaving, then another round of cuts, then most people left, then the company was sold, I think losing a fairly hefty part of its valuation.
Eventually large US company was bought by Oracle, which to my eye indicated Oracle is like MS; they have a single product, which is a massive cash cow, and for the rest, they serially make terrible decisions (a la Nokia et al).
The topic is indeed very interesting but before studying commit author diversity it would be useful to understand the volume and traction. Statistically most of the forks are dead ends, even if maintained by a few enthusiasts for some time.
I'm sure opensearch won't die until it's a commercial offering of AWS but how is going? Any new features coming, a product roadmap exists? Or it's mainly bugfixes and maintenance? What about Opentofu?
Even something basic like a graph of LOC changed over time, with a fork in the middle would help to put the article into perspective.
OpenTofu is doing really well I'd say, and only picking up steam as it's going.
Product roadmap-wise, the team has made some big improvements that have been requested by the community for years, with another big release coming very soon (I believe next week or the one after), here's some of the major ones:
- End-to-End State Encryption - lets you encrypt your state-file end-to-end, either with a key management system like AWS KMS, or static keys.
- Early Evaluation - the ability to parameterize initialiation-time values, like module versions and sources, backend configuration parameters, etc. and keep them DRY.
- (Coming in 1.9) - provider iteration, which lets you use for_each with providers, e.g. create one provider per region, something that currently requires a bunch of copy-paste, or tools like Terragrunt
- (Coming in 1.9) - -exclude flag, which is the opposite of the -target flag, letting you skip planning/applying certain resources.
Probably the best way to see a summary is check out the release blog posts for 1.7[0], 1.8[1], and 1.9-beta[2]. Many of those required non-trivial changes to existing parts of the codebase.
One of the biggest Terraform contributors has also joined Spacelift a couple months ago to work on OpenTofu. All things considered, I'm very confident that the team will be able to handle any feature it sets their minds to, and that those improvements will keep coming. There's a ranking of top-voted issues which is probably the best way to loosely see what will be tackled next[3].
[0]: https://opentofu.org/blog/opentofu-1-7-0/
[1]: https://opentofu.org/blog/opentofu-1-8-0/
[2]: https://opentofu.org/blog/opentofu-1-9-0-beta1/
[3]: https://github.com/opentofu/opentofu/issues/1496
Disclaimer: I am involved in the OpenTofu project and was previously its tech lead.
Provider iteration is a really nice one - I had a big monorepo that would deploy some baseline services in many AWS accounts, across multiple regions, generating tf.json files for each provider to match all accounts that were created.
However, what really broke this model at some point was the fact that we were running so many providers instances that our Terraform Cloud would go out of memory! Since each provider instance in tf is really launching a new process it really adds up... At some point I was thinking since the engine and the providers use gRPC to communicate, it MAY be possible to distribute providers across machines, but I never investigated it further... I'm pretty sure there was a notice in the tf plugin SDK stating that it was not possible to connect them over a network... but why not? ¯\_(ツ)_/¯
Yeah, esp. the AWS provider is pretty memory-intensive.
I believe someone on the team did some investigation into this (running providers remotely) but it's not really a priority (if it is for you, feel free to voice that on the issue tracker!).
Frankly though, with pricing for cloud instances being generally linear wrt to the CPU/memory size of the instance, I don't think there's much reason to prefer many smaller machines over just using a larger single one and avoiding all this added complexity.
We’re just migrating to OpenSearch from Xapian and it was one of the question we had. We didn’t want to go from a solution in maintenance mode to one which could be a dead end soon.
From what we’ve seen there is lots of active development going on with many features being added.
But we don’t yet use any fancy features and could easily switch to ElasticSearch if need be. So we got a backup plan.
> But we don’t yet use any fancy features and could easily switch to ElasticSearch if need be.
I think the decision paralysis is the big deal here. I have the exact same situation with OpenTofu and Terraform, they're diverging rapidly yet it's not entirely clear which way the wind is blowing. They both now have compelling and interesting features that the other doesn't have.
So the outcome is that I'm now not using any new features.
I think many people have been in this situation.
In practice, and I'm extremely biased here, I'd consider the most risk-averse option to be going with OpenTofu but not using any of its exclusive new features. With this you get dependency updates and the widest competitive range of vendors in case you ever want to use a commercial orchestrator service for it.
However, it seems to me folks at companies of all sizes are increasingly deciding to bite the bullet and migrate, esp. since the last release a couple months ago. E.g. see the talk by Fidelity[0] on OpenTofu Day at Kubecon.
[0]: https://youtu.be/7Ypulc2GyoE
Disclaimer: I am involved in the OpenTofu project and was previously its tech lead.
I'm scared of people measuring code in "lines added" and "lines deleted". Tbh sometimes good fix removes 10 lines and adds one, but good. I can also imagine that all of the merge requests are approved by the company "owning" the opensource project, hence after rebase the author will be always from this company...
I understand that the companies probably did majority of the work, However I can't put my finger on this comparison... Sounds strange and inaccurate
As with many more modern "Open source" projects, the openness is more of a "You can try for free, and once you need production level SLA come call us". While the source code is available to look at it's more of a facade as almost no one change it but the owning company. Limited by complexity of the project, maintainer politics and IP around supporting tools and materials.
OSS has basically theseus shiped into something completely different. Not a criticism just an observation.
One key element often overlooked but mentioned in the recent post from Fusion Auth is the Business Continuity aspect. If your provider suddenly shuts down but the product was open source and self hostable, you can pay someone else to keep it working while you work on a migration plan on your own timeline.
The more open the license, the more options available.
Also from a license negotiation perspective, gives the buying company the option to threaten to self-host and/or fork. Even if they never do (and I'm sure the source company is very careful to balance the value story), it can act as a ceiling for rate increases or other annoying business practices.
Why would a company ever open-source their product then? Giving up that complete leverage can be a selling point during the purchasing process, making buyers more comfortable that they won't be (completely) locked in, and be a net positive on revenue through faster sales.
Hm, I was expecting more business point of view in this article. Right now we are looking information about financial results from relicensing open source. Unfortunately, it is about repositories health. But the article still interesting.
There's some more discussion https://thenewstack.io/why-open-source-forking-is-a-hot-butt... that may be interesting.
Are you in the UK by any chance? I'm sure OpenUK would be interested to chat more (given they've been working on research and impact analysis in this area)
An equally important question is how much of the lack of organizational diversity in forked projects was due to constraints/roadblocks to contributions imposed by the controlling companies.
> It is still too early to understand the ultimate success or failure of these projects — both the original and the fork.
Mm. This is more “here are some projects and their forks” than “what happens to…”
Ie. TLDR; they’re both going fine in all cases, so far.
Guess we wait and see eh?
Related: "Fear of Forking" by Rick Moen
http://linuxmafia.com/faq/Licensing_and_Law/forking.html
> That's why forking is uncommon in open-source code, and even more so in (specifically) GPLed code: The improvements one group makes in its would-be "fork" are freely available to the main community.
Unfortunately, in the smartphone world this just isn't reality. Trying to obtain code dumps is hard enough for major brands, outright impossible for the myriad of cheap clones. And embedded is even worse, almost no one cares about distributing the GPL code of the BSP, mainly due to fear of violating chipset vendor NDAs.
> almost no one cares about distributing the GPL code of the BSP, mainly due to fear of violating chipset vendor NDAs
Most problem of BSP programming is it is really complicated, because need to fit within limits of hardware and need to have deep knowledge of DSP environment.
So it is very interest question, who will do complicated things for free, or who will dive deep for free.
Unfortunately, too many people compare apples with carrots, in this case compare definitively shallow frontend/full-stack programming vs hardcore embedded.
And returning to question, in real life, nobody want to rewrite all core code for BSP, but really use huge chunks of ready made code, provided by vendor, so, sure they have very tight coupling to vendor copyrights.
Sometimes, things are even worse with hardware limitations, which just made impossible to write other way than does vendor.
Other problem, regulations - using vendor code you automatically obey laws, or to be honest, you shift responsibility to vendor, but if write your own code from scratch, will need someway create proof that people could trust to your code.
> Trying to obtain code dumps is hard enough for major brands, outright impossible for the myriad of cheap clones. And embedded is even worse, almost no one cares about distributing the GPL code of the BSP, mainly due to fear of violating chipset vendor NDAs.
Sounds like an opportunity for the copyright holders to make some money by suing and dual licensing.
What money? You get to spend millions litigating the matter and the end result is a tarball with a couple new drivers. They do a board rev in 6mos, forget their obligations, and you're back where you started.
> The improvements one group makes in its would-be "fork" are freely available to the main community.
IANAL, but there's a caveat here, which is that a lot of these forks are due to companies relicensing to source-available licenses, which generally means they require a CLA (and full copyright license) from each of their contributors, so that they can relicense the codebase at will.
The code committed to the fork can't be pulled by the relicensed project in this case, unless it's the original contributor making a contribution to both, because such code would only be covered by the fork's license, not by the new license nor CLA.
My biggest gripes are u-boot and the Linux kernel. Both are clearly GPL only, you must provide sourcecode for your modifications including drivers as a vendor, and yet so many make one jump through hoops it's not even close to funny any more. Or they don't fulfill their obligations at all.
for a short time only improvements in one are available. Then the two diverge andchangesecannot merge. khtml couldn't bring in any changes from apples fork
The diversity measures used in this study are a fascinating window into some unusually measurable communities. The count of contributors and volume of their contributions are proxies to many things, including project popularity, ease of contribution, number of approachable fixes (tantamount to how many simple/low-hanging bugs there are), and diversity of use cases exposing the product to new situations (i.e. potential growth of project scope). These things need to align for diversity of contributors' motivations to arise and contributors to approach the project initially, but different things need to arise to sustain involvement: introduction of new bugs, need for completely new features (i.e. a growing project scope), continuing need for refinement of otherwise battle-tested code (i.e. performance gains remaining on the table), and continued relevance as other alternative packages and paradigms rise and fall. I can't wait to read their future work, and I hope it includes measures of project maturity (in the senses of feature-completeness/code quality as well as whether functional scope is growing or not). Surely there are projects that lack contributors for the simple reason that the projects are "done", and surely there are projects where engagement looks like disproportionately many shallow contributions due to immaturity of the product, and surely there are projects that have wider or shallower pools of scope to draw from (as well as management ethoses that readily take on new scope or are avoidant of the same).
Some opposing examples are the Linux kernel (eternally growing scope, with huge motivation by many user communities) and libpng (which is relatively fixed in scope, with desirements like security increasing the bar for contributions to an already mature and popular product).