• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Thursday, July 3, 2025
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

New open-source Machine Learning Framework written in Java

Josh by Josh
June 16, 2025
in Al, Analytics and Automation
0
New open-source Machine Learning Framework written in Java
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

Confronting the AI/energy conundrum


  • October 19, 2014
  • Vasilis Vryniotis
  • . 5 Comments
open-source

I am happy to announce that the Datumbox Machine Learning Framework is now open sourced under GPL 3.0 and you can download its code from Github!

What is this Framework?

The Datumbox Machine Learning Framework is an open-source framework written in Java which enables the rapid development of Machine Learning models and Statistical applications. It is the code that currently powers up the Datumbox API. The main focus of the framework is to include a large number of machine learning algorithms & statistical methods and be able to handle small-medium sized datasets. Even though the framework targets to assist the development of models from various fields, it also provides tools that are particularly useful in Natural Language Processing and Text Analysis applications.

What types of models/algorithms are supported?

The framework is divided in several Layers such as Machine Learning, Statistics, Mathematics, Algorithms and Utilities. Each of them provides a series of classes that are used for training machine learning models. The two most important layers are the Statistics and the Machine Learning layer.

The Statistics layer provides classes for calculating descriptive statistics, performing various types of sampling, estimating CDFs and PDFs from commonly used probability distributions and performing over 35 parametric and non-parametric tests. Such types of classes are usually necessary while performing explanatory data analysis, sampling and feature selection.

The Machine Learning layer provides classes can be used in a large number of problems including Classification, Regression, Cluster Analysis, Topic Modeling, Dimensionality Reduction, Feature Selection, Ensemble Learning and Recommender Systems. Here are some of the supported algorithms: LDA, Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and more.

Datumbox Framework VS Mahout VS Scikit-Learn

Both Mahout and Scikit-Learn are great projects and both of them have completely different targets. Mahout supports only a very limited number of algorithms which can be parallelized and thus use Hadoop’s Map-Reduce framework to handle Big Data. On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. Moreover it is developed in Python, which is a great language for prototyping and Scientific Computing but not my personal favourite for software development.

The Datumbox Framework sits in the middle of the two solutions. It tries to support a large number of algorithms and it is written in Java. This means that it can be incorporated easier into production code, it can easier be tweaked to reduce memory consumption and it can be used in real time systems. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, it is within my plans to expand it to handle large-sized datasets.

How stable is it?

The early versions of the framework (up to 0.3.x) were developed in August and September of 2013 and they were written in PHP (yeap!). During May and June 2014 (versions 0.4.x), the framework was rewritten in Java and enhanced with additional features. Both branches were heavily tested in commercial applications including the Datumbox API. The current version is 0.5.0 and it seems mature enough to be released as the first public alpha version of the framework. Having said that, it is important to note that some functionalities of the framework are tested more thoroughly than others. Moreover since this version is alpha, you should expect drastic changes on the future releases.

Why I wrote it and why I open-source it?

My involvement with Machine Learning and NLP dates back to 2009 when I co-founded WebSEOAnalytics.com. Since then I have been developing implementations of various machine learning algorithms for various projects and applications. Unfortunately most of the original implementations were very problem-specific and they could hardly be used in any other problem. In August 2013 I decided to start Datumbox as a personal project and develop a framework that provides the tools for developing machine learning models focusing in the area of NLP and Text Classification. My target was to build a framework that would be reused on the future for developing quickly machine learning models, incorporating it in projects that require machine learning components or offer it as a service (Machine Learning as a Service).

And here I am now, several lines of code later, open-sourcing the project. Why? The honest answer is that at this point, it is not within my plans to go through a “let’s build a new start-up” journey. At the same time I felt that keeping the code on my hard disk in case I need it on the future does not make sense. So the only logical thing to do was to open-source it. 🙂

Documentation?

If you read the previous two paragraphs, you should probably seen this coming. Since the framework was not developed having in mind that I would share it with others, the documentation is poor/non-existent. Most of the classes and public methods are not properly commented and there is no document describing the architecture of the code. Fortunately all the class names are self-explanatory and the framework provides JUnit tests for every public method & algorithm and these can be used as examples of how to use the code. I hope that with the help of the community we will build a proper documentation, so I am counting on you!

Current Limitations and Future Development

As in every piece of software (and especially the open-source projects in alpha version), the Datumbox Machine Learning Framework comes with its own unique and adorable limitations. Let’s dig into them:

  1. Documentation: As mentioned earlier, the documentation is poor.
  2. No Multithreading: Unfortunately the framework does not currently support Multithreading. Of course we should note that not all machine learning algorithms can be parallelized.
  3. Code Examples: Since the framework has just been published, you can’t find any code examples on the web other than those provided by the framework in the form of JUnit tests.
  4. Code Structure: Creating a solid architecture for any large project is always challenging, let alone when you have to deal with Machine Learning algorithms that differ significantly (supervised learning, unsupervised learning, dimensionality reduction algorithms etc).
  5. Model Persistence and Large Data Collections: Currently the models can be trained and stored either on files on disk or in MongoDB databases. To be able to handle large amount of data, other solutions must be investigated. For example MapDB seems like a good candidate for storing data and parameters while training. Moreover it is important to remove any 3rd party libraries that currently handle the persistence of the models and develop a better dry and modular solution.
  6. New algorithms/tests/models: There are so many great techniques that are not currently supported (especially for time series analysis).

Unfortunately all the above are too much work and there is so little time. That is why if you are interested in the project, step forward and give me a hand with any of the above. Moreover I would love to hear from people who have experience in open-sourcing medium-large projects and could provide any tips on how to manage them. Additionally I would be grateful to any brave soul who would dare to look into the code and document some classes or public methods. Last but not least if you use the framework for anything interesting, please drop me a line or share it with a blog post.

 

Finally I would like to thank my love Kyriaki for tolerating me while writing this project, my friend and super-ninja-Java-developer Eleftherios Bampaletakis for helping out with important Java issues and you for getting involved in the project. I’m looking forward to your comments.



Source_link

Related Posts

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output
Al, Analytics and Automation

DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output

July 3, 2025
Confronting the AI/energy conundrum
Al, Analytics and Automation

Confronting the AI/energy conundrum

July 3, 2025
Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters
Al, Analytics and Automation

Baidu Open Sources ERNIE 4.5: LLM Series Scaling from 0.3B to 424B Parameters

July 2, 2025
Novel method detects microbial contamination in cell cultures | MIT News
Al, Analytics and Automation

Novel method detects microbial contamination in cell cultures | MIT News

July 2, 2025
Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval
Al, Analytics and Automation

Baidu Researchers Propose AI Search Paradigm: A Multi-Agent Framework for Smarter Information Retrieval

July 2, 2025
Merging design and computer science in creative ways | MIT News
Al, Analytics and Automation

Merging design and computer science in creative ways | MIT News

July 1, 2025
Next Post
Which Cloud Hosting Should You Choose?

Which Cloud Hosting Should You Choose?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
7 Best EOR Platforms for Software Companies in 2025

7 Best EOR Platforms for Software Companies in 2025

June 21, 2025
Eating Bugs – MetaDevo

Eating Bugs – MetaDevo

May 29, 2025
Top B2B & Marketing Podcasts to Lead You to Succeed in 2025 – TopRank® Marketing

Top B2B & Marketing Podcasts to Lead You to Succeed in 2025 – TopRank® Marketing

May 30, 2025
Entries For The Elektra Awards 2025 Are Now Open!

Entries For The Elektra Awards 2025 Are Now Open!

May 30, 2025

EDITOR'S PICK

Unlockable Reels Are Coming to Instagram: Here’s How They Work

Unlockable Reels Are Coming to Instagram: Here’s How They Work

June 8, 2025
How to Detect and Resolve ESXi Host CPU Contention Using Aria Operations

How to Detect and Resolve ESXi Host CPU Contention Using Aria Operations

June 18, 2025
Ajoutez des intervenants à vos opportunités

Ajoutez des intervenants à vos opportunités

June 9, 2025
Behind the Scenes of Continuous Improvement: Interview with a Regpack Account Executive

Behind the Scenes of Continuous Improvement: Interview with a Regpack Account Executive

May 28, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Squid Game X Script (No Key, Auto Win, Glass Marker)
  • DeepSeek R1T2 Chimera: 200% Faster Than R1-0528 With Improved Reasoning and Compact Output
  • Google’s customizable Gemini chatbots are now in Docs, Sheets, and Gmail
  • 24 Effective Ways to Drive Website Traffic in 2025 (Complete Guide)
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?