Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
Python Data Wrangling: Preparing for the FutureWes McKinney
Given at PyCon HK on October 29, 2016. About open source work in progress to advance the Python pandas project internals and leverage synergies with other efforts in OSS data technology
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
4th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Learn how to visualize graphs – a powerful, intuitive way to interact with data. Using open source tools like Cytoscape or third party tools, you have several choices on how to visualize and interact with graphs from Oracle Database and big data platforms. Albert Godfrind (EMEA Solutions Architect) and Gabriela Montiel-Moreno (Software Development Manager) share all you need to get started, with detailed demos using a banking customer data set.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
2nd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
With property graphs in Oracle Database, you can perform powerful analysis on big data such as social networks, financial transactions, sensor networks, and more.
To use property graphs, first, you’ll need a graph model. For a new user, modeling and generating a suitable graph for an application domain can be a challenge. This month, we’ll describe key steps required to construct a meaningful graph, and offer a few tips on validating the generated graph.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
This presentation is an attempt to summarize the NumPy roadmap and both technical and non-technical ideas for the next 1-2 years to users that heavily rely on NumPy, as well as potential funders.
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
4th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Learn how to visualize graphs – a powerful, intuitive way to interact with data. Using open source tools like Cytoscape or third party tools, you have several choices on how to visualize and interact with graphs from Oracle Database and big data platforms. Albert Godfrind (EMEA Solutions Architect) and Gabriela Montiel-Moreno (Software Development Manager) share all you need to get started, with detailed demos using a banking customer data set.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
2nd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
With property graphs in Oracle Database, you can perform powerful analysis on big data such as social networks, financial transactions, sensor networks, and more.
To use property graphs, first, you’ll need a graph model. For a new user, modeling and generating a suitable graph for an application domain can be a challenge. This month, we’ll describe key steps required to construct a meaningful graph, and offer a few tips on validating the generated graph.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
This presentation is an attempt to summarize the NumPy roadmap and both technical and non-technical ideas for the next 1-2 years to users that heavily rely on NumPy, as well as potential funders.
This presentation describes some of the Open Source Ai projects we are working at the Center for Open Source, Data and AI Technologies (CODAIT), including Model Asset Exchange (MAX), Fabric for Deep Learning (FfDL) and Jupyter Enterprise Gateway.
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
SpagoBI 5 Demo Day and Workshop : Technology Applications and UsesSpagoWorld
These slides supported SpagoBI Labs' presentation of SpagoBI 5 ("Technology Applications and Uses" session), taking place in New York, NY on January 26th, and in Herndon, VA on January 28th, 2015. Further details on the event: http://bit.ly/1IzatIX
Apache AGE and the synergy effect in the combination of Postgres and NoSQLEDB
In this session, we will introduce the concept of Apache AGE and the synergy effect in the combination of Postgres and NoSQL (Graph Database). We shall discuss the story and background of Apache AGE as an open-source project and introduce challenges that AGE can solve for its users. Furthermore, we will talk about a graph database as an extension to PostgreSQL and how it can support all the functionalities and features of PostgreSQL and offers a graph model in addition. We will also discuss how users with a relational background and data model who are in need of having a graph model on top of their existing relational model, can use this extension with minimal effort because they can use existing data without migration to enable a graph database.
GraphPipe - Blazingly Fast Machine Learning Inference by Vish AbramsOracle Developers
GraphPipe is an open source protocol and collection of software designed to simplify machine learning model deployment and decouple it fromframework-specific model implementations.
The common perception of applying deep learning is that you take an open source or research model, train it on raw data, and deploy the result as a fully self-contained artefact. The reality is far more complex.
For the training phase, users face an array of challenges including handling varied deep learning frameworks, hardware requirements and configurations, not to mention code quality, consistency, and packaging. For the deployment phase, they face another set of challenges, ranging from custom requirements for data pre- and postprocessing, inconsistencies across frameworks, and lack of standardization in serving APIs.
The goal of the IBM Developer Model Asset eXchange (MAX) is to remove these barriers to entry for developers to obtain, train, and deploy open source deep learning models for their business applications. In building the exchange, we encountered all these challenges and more.
For the training phase, we leverage the Fabric for Deep Learning (FfDL), an open source project providing framework-independent training of deep learning models on Kubernetes. For the deployment phase, MAX provides standardized container-based, fully self-contained model artifacts encompassing the end-to-end deep learning predictive pipeline.
28March2024-Codeless-Generative-AI-Pipelines
https://www.meetup.com/futureofdata-princeton/events/299440871/
https://www.meetup.com/real-time-analytics-meetup-ny/events/299290822/
******Note*****
The event is seat-limited, therefore please complete your registration here. Only people completing the form will be able to attend.
-----------------------
We're excited to invite you to join us in-person, for a Real-Time Analytics exploration!
Join us for an evening of insights, networking as we delve into the OSS technologies shaping the field!
Agenda:
05:30-06:00: Pizza and friends
06:00- 06:40: Codeless GenAI Pipelines with Flink, Kafka, NiFi
06:40- 07:20 Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders
07:20-07:30 QNA
Codeless GenAI Pipelines with Flink, Kafka, NiFi | Tim Spann, Cloudera
Explore the power of real-time streaming with GenAI using Apache NiFi. Learn how NiFi simplifies data engineering workflows, allowing you to focus on creativity over technical complexities. I'll guide you through practical examples, showcasing NiFi's automation impact from ingestion to delivery. Whether you're a seasoned data engineer or new to GenAI, this talk offers valuable insights into optimizing workflows. Join us to unlock the potential of real-time streaming and witness how NiFi makes data engineering a breeze for GenAI applications!
Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders | Viktor Gamov, StarTree
Explore how industry leaders like LinkedIn, Uber Eats, and Stripe are mastering real-time data with Viktor as your guide. Discover how Apache Pinot transforms data into actionable insights instantly. Viktor will showcase Pinot's features, including the Star-Tree Index, and explain why it's a game-changer in data strategy. This session is for everyone, from data geeks to business gurus, eager to uncover the future of tech. Join us and be wowed by the power of real-time analytics with Apache Pinot!
-------
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera.
He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more.
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupWojciech Biela
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. One key feature in Presto is the ability to query data where it lives via an uniform ANSI SQL interface. Presto’s connector architecture creates an abstraction layer for anything that can be represented in a columnar or row-like format, such as HDFS, Amazon S3, Azure Storage, NoSQL stores, relational databases, Kafka streams and even proprietary data stores. Furthermore, a single Presto query can combine data from multiple sources, allowing for analytics across an entire organization.
Conf42-Python-Building Apache NiFi 2.0 Python Processors
https://www.conf42.com/Python_2024_Tim_Spann_apache_nifi_2_processors
Building Apache NiFi 2.0 Python Processors
Abstract
Let’s enhance real-time streaming pipelines with smart Python code. Adding code for vector databases and LLM.
Summary
Tim Spann: I'm going to be talking today, be building Apache 9520 Python processors. One of the main purposes of supporting Python in the streaming tool Apache Nifi is to interface with new machine learning and AI and Gen AI. He says Python is a real game changer for Cloudera.
You're just going to add some metadata around it. It's a great way to pass a file along without changing it too substantially. We really need you to have Python 310 and again JDK 21 on your machine. You got to be smart about how you use these models.
There are a ton of python processors available. You can use them in multiple ways. We're still in the early world of Python processors, so now's the time to start putting yours out there. Love to see a lot of people write their own.
When we are parsing documents here, again, this is the Python one I'm picking PDF. Lots of different things you could do. If you're interested on writing your own python code for Apache Nifi, definitely reach out and thank.
SIM RTP Meeting - So Who's Using Open Source Anyway?Alex Meadows
Open Source has been around for several decades now, but there is still a bit of mystery around what makes open source work and concern about using it in the enterprise. Open Source technologies are being widely used in many industries, including analytics, software development, social media, data center management, and more.
The discussion will be moderated by Julie Batchelor and panelists include:
* Todd Lewis, Open Source evangelist
* Jason Hibbets, Open Source Community Manager
* Jim Salter, Co-Owner and Chief Technology Officer at Openoid, LLC
* Alex Meadows, data scientist
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
Lecture to the London S2DS students.
Some fun in highlighting that I'm their polar opposite (no schooling since 17, and focused on operations not science).
Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Similar to Apache Hivemall and my OSS experience (20)
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Apache Hivemall and my OSS experience
1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Principal Engineer
Makoto Yui @myui
Apache Hivemall and
my OSS experience
2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
5m
15m
5m
3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
About Me: Makoto Yui @myui
○ Leading the development of Apache Hivemall (incubating at ASF)
○ ML Engineer with DB system research background
● Developing ML features (and underlying systems) at SaaS company
■ Joined to Treasure Data in April 2015 (4 years ago)
■ Working at Tokyo branch of a Silicon Valley company (Acquired by Arm on July 2018)
● Ph.D (CS) in 2009 at NAIST
■ majored in Parallel Database Systems and XML native database systems
(e.g., non-blocking lock-free DB buffer management at ICDE 2010)
● As a DB researcher
■ Postdoc at CWI (MonetDB team in CWI Amsterdam; columnar in-memory DB pioneer)
■ 5 years at AIST (National research institute in Tsukuba) as a Senior Researcher
● Past and the current Interest
■ Query+FP Language → Parallel DB → In-database Analytics (OLAP++) → Scalable Machine Learning (now) → ?
4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
My OSS history
○ Big fan of OSS since undergraduate student
● I was using Redhat Linux on my laptop
○ Intern at small startup
● FreeBSD 4.2+, PostgreSQL 6.4+~7.x, PHP 4, and plain old-C
● PostgreSQL and Glib (not glib) was my favorite project
○ XpSQL at Gborg (first OSS for me)
● Founded by Gov fund for young software engineers
● My Bachelor thesis in 2003:
Building a multi-functional XML database environment using RDBMS
5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Linux movement when I was undergraduate student
○ Interested in well-designed code of Postgres
● ❤ Data structure and algorithms for big data
■ b+-tree was much interesting than binary tree (in-memory) for me
○ Communication with other excellent engineers from other organization
○ More interested in library development than application development
○ Good for Carrier development (hard to find jobs with no github repos)
○ Why not OSS?
● No so many excellent talents in a single organization for library development
● Developers prefer standard OSS libraries (avoid vendor/company lock-in) in general
👍 for Open for Closed
What interested you in open source?
6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Arm Treasure Data
Company Profile
8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Treasure Data Founders
Hironobu Yoshikawa
CEO & Co-Founder
Open source business veteran
Kazuki Ohta
CTO & Co-Founder
Founder of world’s largest
Hadoop Group
Sadayuki Furuhashi
Engineer & Co-Founder
MessagePack, Fluentd Inventor
9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Engineers are actively contributing to Presto, Hive, Hadoop, Rails, Ruby, React among others.
10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
We Open-source! TD invented ..
Streaming log collector Bulk data import/export Efficient binary serialization
Machine learning on Hadoop Workflow EngineEmbedded version of Fluentd
15. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Project overview – Apache Hivemall
○ Scalable Machine Learning library for Apache Hive/Spark/Pig
○ Initially released in 2014 when I was a researcher at AIST
● Infoworld Bossie Awards 2014: The best open source big data tools
● Talked at Hadoop summit 2014 (got lots of attention)
● 500+ github stars and 150+ folks
● 15 contributors when joining before ASF incubator
○ Incubating since Sept 2016
● Recruited mentors from Hortonworks and Databricks, Microsoft, and Pivotal
● Contributors from Treasure Data, NTT, and other individuals
○ Planning to graduate incubator in 2020
● Needs more ASF release and external contributions (community growth)
17. BigQuery ML at Google I/O 2018
17
h"ps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
Hadoop Conf Japan - Mar 14, 2019
18. 18
Open-source Machine Learning Solution
for SQL-on-Hadoop
Hadoop Conf Japan - Mar 14, 2019
hivemall.apache.org (incubating)
28. 28
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall (beta version)
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
Hadoop Conf Japan - Mar 14, 2019
29. 2018/2/17 HackerTackle 29
Real-world ML pipelines (could be more complex)
Join
Extract Feature
Datasourc
e
#1
Datasourc
e
#2
Datasourc
e
#3
Extract Feature
Feature Scaling
Feature Hashing
Feature Engineering
Feature Selection
Train by
Logis9c Regression
Train by
RandomForest
Train by
Factorization Machines
Ensemble
Evaluate
Predict
37. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Plan of the talk
1. Introduction to myself, company, and my OSS experience
2. Introduction to Apache Hivemall
3. Apache Incubation process
• What’s Hivemall? - project overview
• Deep dive into the feature of Hivemall
• Open-source use-cases and internal use-cases at Treasure Data
• Who am I? What interested me in open source?
• Treasure Data? How my company cope with OSS?
• How Hivemall get into the ASF incubator? Why ASF?
• What’s required and the hardest part of it?
• What is the most overlooked part of the incubation process
• Lesson learned from my (on-going) ASF incubation experience
38. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Personal projects never live long :-(
○ Got a lot of attention at Hadoop summit 2014
● Hortonworks developer recommended Hivemall to join ASF incubator
● Apache Pig developer was evaluating Hivemall (required him as initial project member)
○ Apache is a trusted brand for developers
● ASF’s meritocracy model
● Apache way: open governance, community over code
○ ASF is a natural choice
● Hivemall runs on the top of ASF hadoop ecosystem
How Hivemall get into Apache Incubator? Why ASF?
39. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Recruiting Champion/mentors
● Need to recruit ASF veteran who knows ASF incubation process well
■ Our CTO is a friend of Roman (who is prev Incubator Chair) and introduced him
● Big company hire ASF member(s) for incubating a project
● If your company has 2-3+ ASF members, the process would be more smooth
https://wiki.apache.org/incubator/HivemallProposal
○ ASF member’s assist/vote is mandatory in the incubation process
● release votes requires three +1 from ASF members
● project setup, mentor sign-off for project report
● toward graduation process
Hardest part for Apache Incubator
- Mentors can be unresponded over time (e.g., due to job role
change)
- Volunteering is limited (without $$ possibilities)
- Most graduated projects are developed mainly by company hired
engineers (Cloudera/Hortornworks/IBM ...) with external
contribution (small patches)
40. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
○ Community building
● Not mandatory for Incubator projects but expected to build active community
● Usually, company-backed engineers are working hard for developing core features
■ <user@> mailing list is not required. <dev@> is the place for discussion in Incubator
■ not so many active developers
● Meetup(s)
■ Held 4 meetups in Tokyo in total
■ location problems (better to have one in US; lack of connections)
○ Overlooked cost of ASF incubation
● Release process is restricted by incubator policy (e.g., votes and license inspection)
● Time spent for incubation process
■ incubation report, a project status page, project page
■ artifact distribution procedures
Hardest part for Apache Incubator (for us)
41. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Lesson learned: Engineering trends changes rapidly
Postmortem: Better to incubate early and graduate soon
2004 was peak.. 2016 was too late to join ASF incubator
Apart from frameworks, standalone library has more long life cycle
42. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
OSS project (apart from Hivemall)
Full-fledged B+-tree written in pure Java (to appear)
https://github.com/myui/btree4j
b+-tree is widely known data structure but there are no good library
as OSS. LSM-tree and Mass-tree is based on B+-tree
Extracted as a library from my past work on XML native DB
Currently, preparing to project page and performance comparison