This document introduces Hivemall, an open-source machine learning library built as Hive UDFs. It summarizes new features in version 0.4, including Random Forest and Factorization Machine algorithms. The speaker then outlines the development roadmap, with plans to add Gradient Tree Boosting, Field-aware Factorization Machines, Online LDA, and a Mix server in upcoming versions. Real-world use cases of Hivemall are also briefly mentioned.
Data Science is one of the hottest career options globally right now with data scientists earning an average of 15 lacs to 18 lacs annually. This deck explains the fundamentals of Data Science, the role of a Data Scientist.
The deck also introduces the Certificate Masterclass in Data Science with Python by Spotle Learn. This course is specifically designed by the experts for the people who want to build a career in data science. This course will equip you with the fundamental knowledge and practical expertise required for data science careers through a rigorous pedagogy based on videos, live projects, interactive classes and integrated internships.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
Data Science is one of the hottest career options globally right now with data scientists earning an average of 15 lacs to 18 lacs annually. This deck explains the fundamentals of Data Science, the role of a Data Scientist.
The deck also introduces the Certificate Masterclass in Data Science with Python by Spotle Learn. This course is specifically designed by the experts for the people who want to build a career in data science. This course will equip you with the fundamental knowledge and practical expertise required for data science careers through a rigorous pedagogy based on videos, live projects, interactive classes and integrated internships.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Autodeploy a complete end-to-end machine learning pipeline on Kubernetes using tools like Spark, TensorFlow, HDFS, etc. - it requires a running Kubernetes (K8s) cluster in the cloud or on-premise.
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...Edureka!
The free webinar on Python titled "Mastering Python - An Excellent tool for Web Scraping and Data Analysis" was conducted by Edureka on 14th November 2014
Python is dominating the fast-growing data-science landscape. This talk provides a foundational overview of the practice of data science and some of the most popular Python libraries for doing data science. It also provides an overview of how Anaconda brings it all together.
Python and BIG Data analytics | Python Fundamentals | Python ArchitectureSkillspeed
This Python tutorial will unravel the pro and cons of Python; covering Fundamentals and Advantages of Python. A comprehensive comparison of MapR and Python has been mentioned. At the end, you'll know why Python is a High Level Scripting Tool for BIG Data Analytics
---------
PPT Agenda:
Introduction to Python
Web Scraping Use Case?
Introduction to BIG Data and Hadoop
MapReduce
PyDoop
Word Count Use Case
---------
What is Python? - Introduction Python
Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.
----------
Why Python? - Python Advantages
Clear Syntax
Good for Text Processing
Extended in C and C++
Generates HTML content
Pre-Defined Libraries – NumPy, SciPy
Interpreted Environment
Automatic Memory Management
Good for Code Steering
Merging Multiple Programs
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor-led training in BIG Data & Hadoop featuring 24/7 Lifetime Support, 100% Placement Assistance & Real-time Projects.
Email: sales@skillspeed.com
Website: www.skillspeed.com
Number: +91-90660-20904
Facebook: https://www.facebook.com/SkillspeedOnline
Linkedin: https://www.linkedin.com/company/skillspeed
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Using Python makes Programmers more productive and their programs ultimately better. Python is continued to be a favorite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python runs on Windows, Linux/Unix, Mac OS and has been ported to Java and .NET virtual machines. Python is free to use, even for the commercial products, because of its OSI-approved open source license.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...Editor IJMTER
Basic idea is that the search tree could be divided into sub process of equivalence
classes. And since generating item sets in sub process of equivalence classes is independent from
each other, we could do frequent item set mining in sub trees of equivalence classes in parallel. So
the straightforward approach to parallelize Éclat is to consider each equivalence class as a data
(agriculture). We can distribute data to different nodes and nodes could work on data without any
synchronization. Even though the sorting helps to produce different sets in smaller sizes, there is a
cost for sorting. Our Research to analysis is that the size of equivalence class is relatively small
(always less than the size of the item base) and this size also reduces quickly as the search goes
deeper in the recursion process. Base on time using more than using agriculture data we can handle
large amount of data so first we develop éclat algorithm then develop parallel éclat algorithm then
compare with using same data with respect time .with the help of support and confidence.
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...Edureka!
The free webinar on Python titled "Mastering Python - An Excellent tool for Web Scraping and Data Analysis" was conducted by Edureka on 14th November 2014
Python is dominating the fast-growing data-science landscape. This talk provides a foundational overview of the practice of data science and some of the most popular Python libraries for doing data science. It also provides an overview of how Anaconda brings it all together.
Python and BIG Data analytics | Python Fundamentals | Python ArchitectureSkillspeed
This Python tutorial will unravel the pro and cons of Python; covering Fundamentals and Advantages of Python. A comprehensive comparison of MapR and Python has been mentioned. At the end, you'll know why Python is a High Level Scripting Tool for BIG Data Analytics
---------
PPT Agenda:
Introduction to Python
Web Scraping Use Case?
Introduction to BIG Data and Hadoop
MapReduce
PyDoop
Word Count Use Case
---------
What is Python? - Introduction Python
Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.
----------
Why Python? - Python Advantages
Clear Syntax
Good for Text Processing
Extended in C and C++
Generates HTML content
Pre-Defined Libraries – NumPy, SciPy
Interpreted Environment
Automatic Memory Management
Good for Code Steering
Merging Multiple Programs
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor-led training in BIG Data & Hadoop featuring 24/7 Lifetime Support, 100% Placement Assistance & Real-time Projects.
Email: sales@skillspeed.com
Website: www.skillspeed.com
Number: +91-90660-20904
Facebook: https://www.facebook.com/SkillspeedOnline
Linkedin: https://www.linkedin.com/company/skillspeed
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
Programmers love Python because of how fast and easy it is to use. Python cuts development time in half with its simple to read syntax and easy compilation feature. Debugging your programs is a breeze in Python with its built in debugger. Using Python makes Programmers more productive and their programs ultimately better. Python is continued to be a favorite option for data scientists who use it for building and using Machine learning applications and other scientific computations.
Python runs on Windows, Linux/Unix, Mac OS and has been ported to Java and .NET virtual machines. Python is free to use, even for the commercial products, because of its OSI-approved open source license.
Python has evolved as the most preferred Language for Data Analytics and the increasing search trends on python also indicates that Python is the next "Big Thing" and a must for Professionals in the Data Analytics domain.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...Editor IJMTER
Basic idea is that the search tree could be divided into sub process of equivalence
classes. And since generating item sets in sub process of equivalence classes is independent from
each other, we could do frequent item set mining in sub trees of equivalence classes in parallel. So
the straightforward approach to parallelize Éclat is to consider each equivalence class as a data
(agriculture). We can distribute data to different nodes and nodes could work on data without any
synchronization. Even though the sorting helps to produce different sets in smaller sizes, there is a
cost for sorting. Our Research to analysis is that the size of equivalence class is relatively small
(always less than the size of the item base) and this size also reduces quickly as the search goes
deeper in the recursion process. Base on time using more than using agriculture data we can handle
large amount of data so first we develop éclat algorithm then develop parallel éclat algorithm then
compare with using same data with respect time .with the help of support and confidence.
BuildingSMART Standards Summit 2015 - Technical Room - Linked Data for Constr...Pieter Pauwels
Presentation at the Technical Room of the BuildingSMART Standards Summit October 2015 in Singapore. The presentation was done together with Jakob Beetz, TUEindhoven, with strong support by Walter Terkaj, ITIA-CNR, and Kris McGlinn, TCDublin. It is part of the SWIMing H2020 project, run by Kris McGlinn (http://swiming-project.eu/).
Joaquin Salvachúa_Put your data to work on your business using AI/ML with FIW...FIWARE
The webinar present topics that are part of the "Disruptive Tech and Trends" track where participants will discover about how FIWARE is positioned with regards to relevant technological areas:
How to build Data Spaces, as decentralized data ecosystems, using commonly agreed building blocks ensuring Data interoperability, Data Sovereignty & Trust and Data Value Creation.
How Digital Twins and cloud-to-edge continuum can be implemented with FIWARE components, using the NGSI-LD standard, and present real Digital Twin use cases
How to integrate robotics and automation systems in FIWARE based digital twins
How data works on your business using Artificial Intelligence and Machine Learning (AI/ML) with FIWARE and data engineering tools and techniques, such as ML-OPS.
NET FUTURES wishes to maximize competitiveness
of the European technology industry. The conference
(March 25-26) gathered over 700 attendees,
to form an interconnected community involving
companies, organizations and people.
An Idea Wall was installed to collect thoughts and
ideas.
This document shows the raw data of the Idea
Wall. Reading your handwriting was a challenge
some times. Please get in contact when we missed
content or made a mistake in writing your name or
other data.
A basic guide to FI-WARE, the open platform for the Future Internet. Read what FI-WARE is about and find the links to the best resources to quickly start using the technology!
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryIchigaku Takigawa
Perspectives on Artificial Intelligence and Machine Learning in Materials Science
February 4, 2022. – February 6, 2022.
https://joint.imi.kyushu-u.ac.jp/post-2698/
This power point slides best describes the contents taught to us during the internship on Python taken by us in the college. It is totally a practical learning session and we learnt a lot about practical use of Python. So, I think to share it.
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
TOP 10 B TECH COLLEGES IN JAIPUR 2024.pptxnikitacareer3
Looking for the best engineering colleges in Jaipur for 2024?
Check out our list of the top 10 B.Tech colleges to help you make the right choice for your future career!
1) MNIT
2) MANIPAL UNIV
3) LNMIIT
4) NIMS UNIV
5) JECRC
6) VIVEKANANDA GLOBAL UNIV
7) BIT JAIPUR
8) APEX UNIV
9) AMITY UNIV.
10) JNU
TO KNOW MORE ABOUT COLLEGES, FEES AND PLACEMENT, WATCH THE FULL VIDEO GIVEN BELOW ON "TOP 10 B TECH COLLEGES IN JAIPUR"
https://www.youtube.com/watch?v=vSNje0MBh7g
VISIT CAREER MANTRA PORTAL TO KNOW MORE ABOUT COLLEGES/UNIVERSITITES in Jaipur:
https://careermantra.net/colleges/3378/Jaipur/b-tech
Get all the information you need to plan your next steps in your medical career with Career Mantra!
https://careermantra.net/
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
6. Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
2015/10/20 Hivemall meetup #2 6
7. List of Features in Hivemall v0.3.2
Classification (both
binary- and multi-class)
✓ Perceptron
✓ Passive Aggressive (PA)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
Regression
✓Logistic Regression (SGD)
✓PA Regression
✓AROW Regression
✓AdaGrad
✓AdaDELTA
kNN and Recommendation
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search using K-NN
(Euclid/Cosine/Jaccard/Angular)
✓ Matrix Factorization
Feature engineering
✓ Feature Hashing
✓ Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
Anomaly Detection
✓ Local Outlier Factor
Treasure Data supports Hivemall v0.3.2-3
2015/10/20 Hivemall meetup #2 7
8. Ø CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc. and more
Ø Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
Ø Churn Detection
• Algorithm: Regression
• OISIX and more
Ø Item/User recommendation
• Algorithm: Recommendation (Matrix Factorization / kNN)
• Adtech Companies, ISP portal, and more
Ø Value prediction of Real estates
• Algorithm: Regression
• Livesense
Industry use cases of Hivemall
82015/10/20 Hivemall meetup #2
10. CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
2015/10/20 Hivemall meetup #2 10
14. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
2015/10/20 Hivemall meetup #2 14
15. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
2015/10/20 Hivemall meetup #2 15
16. create table news20mc_ensemble_model1as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight)as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label,feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
26 / 43
162015/10/20 Hivemall meetup #2
22. Features to be supported in Hivemall v0.4
2015/10/20 Hivemall meetup #2 22
1.RandomForest
• classification, regression
• Based on Smile github.com/haifengl/smile
2.Factorization Machine
• classification, regression (factorization)
Planned to release v0.4 in Oct.
Factorization Machine are often used by data science
competition winners (Criteo/Avazu CTR prediction)
40. Conclusion and Takeaway
New features in v0.4
2015/10/20 Hivemall meetup #2 40
• Random Forest
• Factorization Machine
More will follow in v0.4.1
Next Actions
• Propose Hivemall to
Apache Incubator
• New Hivemall Logo
Hivemall provides a collection of machine
learning algorithms as Hive UDFs/UDTFs
The latest version of Hivemall is available on
Treasure Data and used by several companies
Including OISIX, Livesense, Scaleout, and Freakout.
49. rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.
0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.
13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.
052084323"]
Unsupervised Learning: Anomaly Detection
Sensor data etc.
Anomaly detection runs on a series of SQL queries
492015/10/20 Hivemall meetup #2