This document summarizes a presentation about the Apache Hivemall machine learning library. Hivemall is a scalable machine learning library built as a collection of Hive UDFs. It allows SQL developers to easily build and run machine learning models in parallel on large datasets. The presentation highlights new features in version 0.5.0 such as anomaly detection algorithms and topic modeling, as well as support for Spark and other platforms. It also demonstrates how to use Hivemall algorithms like random forests and change point detection.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Recent world #1 Kaggle Grandmaster and Research Data Scientist at H2O.ai, Marios Michailidis, will delve into the competitive edge that Driverless AI brings out of the box.
Driverless AI can easily score in the top 5% in popular data science challenges against thousands of participants in a matter of minutes with limited processing power.
Apart from the actual predictions, one can use Driverless AI data munging and derived knowledge of the data to build even more powerful models.
This webinar discusses how Driverless AI can get competitive scores in popular Kaggle challenges. Also, Marios will explain the concepts of hyper-parameter tuning and stacking and how they help to make stronger predictions.
Bio:
Former world no.1 Kaggle Grandmaster, Marios Michailidis, is now a Research Data Scientist at H2O.ai. He is finishing his PhD in machine learning at the University College London (UCL) with a focus on ensemble modeling and his previous education entails a B.Sc in Accounting Finance from the University of Macedonia in Greece and an M.Sc. in Risk Management from the University of Southampton. He has gained exposure in marketing and credit sectors in the UK market and has successfully led multiple analytics’ projects based on a wide array of themes.
Before H2O.ai, Marios held the position of Senior Personalization Data Scientist at dunnhumby where his main role was to improve existing algorithms, research benefits of advanced machine learning methods, and provide data insights. He created a matrix factorization library in Java along with a demo version of personalized search capability. Prior to dunnhumby, Marios has held positions of importance at iQor, Capita, British Pearl, and Ey-Zein.
At a personal level, he is the creator and administrator of KazAnova, a freeware GUI for quick credit scoring and data mining which is made absolutely in Java. In addition, he is also the creator of StackNet Meta-Modelling Framework.
ObjectLayout: Closing the (last?) inherent C vs. Java speed gapAzul Systems Inc.
In this presentation, Azul CTO Gil Tene provides a description of the ObjectLayout.org effort, with a focus on StructuredArray. The goal of this effort is to match the raw speed benefits C-based languages get from commonly used forms of memory layout and make these benefits available for 'Plain Old Java Object" (POJO) use.
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Recent world #1 Kaggle Grandmaster and Research Data Scientist at H2O.ai, Marios Michailidis, will delve into the competitive edge that Driverless AI brings out of the box.
Driverless AI can easily score in the top 5% in popular data science challenges against thousands of participants in a matter of minutes with limited processing power.
Apart from the actual predictions, one can use Driverless AI data munging and derived knowledge of the data to build even more powerful models.
This webinar discusses how Driverless AI can get competitive scores in popular Kaggle challenges. Also, Marios will explain the concepts of hyper-parameter tuning and stacking and how they help to make stronger predictions.
Bio:
Former world no.1 Kaggle Grandmaster, Marios Michailidis, is now a Research Data Scientist at H2O.ai. He is finishing his PhD in machine learning at the University College London (UCL) with a focus on ensemble modeling and his previous education entails a B.Sc in Accounting Finance from the University of Macedonia in Greece and an M.Sc. in Risk Management from the University of Southampton. He has gained exposure in marketing and credit sectors in the UK market and has successfully led multiple analytics’ projects based on a wide array of themes.
Before H2O.ai, Marios held the position of Senior Personalization Data Scientist at dunnhumby where his main role was to improve existing algorithms, research benefits of advanced machine learning methods, and provide data insights. He created a matrix factorization library in Java along with a demo version of personalized search capability. Prior to dunnhumby, Marios has held positions of importance at iQor, Capita, British Pearl, and Ey-Zein.
At a personal level, he is the creator and administrator of KazAnova, a freeware GUI for quick credit scoring and data mining which is made absolutely in Java. In addition, he is also the creator of StackNet Meta-Modelling Framework.
ObjectLayout: Closing the (last?) inherent C vs. Java speed gapAzul Systems Inc.
In this presentation, Azul CTO Gil Tene provides a description of the ObjectLayout.org effort, with a focus on StructuredArray. The goal of this effort is to match the raw speed benefits C-based languages get from commonly used forms of memory layout and make these benefits available for 'Plain Old Java Object" (POJO) use.
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
Big data analysis using Tajo on AWS (Hands-on session)
- presented by Young-kyong Ko, data analyst at Gruter
- at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
Beyond EXPLAIN: Query Optimization From Theory To CodeYuto Hayamizu
EXPLAIN is too much explained. Let's go "beyond EXPLAIN".
This talk will take you to an optimizer backstage tour: from theoretical background of state-of-the-art query optimization to close look at current implementation of PostgreSQL.
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...Juan Cruz Nores
On-the-fly bytecode generation is generally known to be super efficient, but also super difficult to implement and debug. Instead of trying to generate bytecode for the JVM, you can leverage the built-in Java compiler; generate Java code as a string, compile that to bytecode and then have that executed. This gives you better code efficiency, is easier to implement, and is straight-forward to debug. We’ll cover on-the-fly code generation, execution and debugging, working with HotSpot and G1 using dynamic code, as well as how to optimize for engineer implementation time; maximum gain in minimum time. We’ll use practical examples and code snippets, so you can be ready to make the core processing for your business 10x faster.
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDSSumant Tambe
When it comes to sending data across a network, applications send either binary or self-describing data (XML). Both approaches have merits. Data Distribution Service (DDS) combines the best of both in what’s called “data-centric messaging”. DDS shares the type description once, upfront, and later on sends binary data that meets the type description. You typically use IDL or XSD to specify the types and run them through a code generator for type-safe wrapper APIs for your application in your programming language. Simple and fast! As it turns out, however, C++11 bends the rules once again. In this presentation you will learn about a template-based C++11 messaging library that gives the DDS code generator a run for its money. The types and objects in your C++11 application are mapped to standard DDS X-Types type descriptions and serialized format, respectively, using template meta-programming. If you have never heard about SFINAE you won’t stop talking about it after you see "overloading in overdrive" in this presentation. What’s more? I will share my newfound hatred for std::vector of bool/enums. This presentation will cover DDS-XTypes, DDS_TypeCode, DDS_DynamicData, STL, type_traits, Boost Fusion, and overloading with enable_if (lots and lots of it!).
A long time ago in a galaxy far, far away...
Java open source developers managed to the see the previously secret plans to the Empire's ultimate weapon, the JAVA™ COLLECTIONS FRAMEWORK.
Evading the dreaded Imperial Starfleet, a group of freedom fighters investigate the performance of the Empire’s most popular weapons: LinkedList, ArrayList and HashMap. In addition, they investigate common developer errors and bugs to help protect their vital software. With this new found knowledge they strike back!
Pursued by the Empire's sinister agents, JDuchess races home aboard her JVM, investigating proposed future changes to the Java Collections and other options such as Immutable Collections which could save her people and restore freedom to the galaxy....
The search for faster computing remains of great importance to the software community. Relatively inexpensive modern hardware, such as GPUs, allows users to run highly parallel code on thousands, or even millions of cores on distributed systems.
Building efficient GPU software is not a trivial task, often requiring a significant amount of engineering hours to attain the best performance. Similarly, distributed computing systems are inherently complex. In recent years, several libraries were developed to solve such problems. However, they often target a single aspect of computing, such as GPU computing with libraries like CuPy, or distributed computing with Dask.
Libraries like Dask and CuPy tend to provide great performance while abstracting away the complexity from non-experts, being great candidates for developers writing software for various different applications. Unfortunately, they are often difficult to be combined, at least efficiently.
With the recent introduction of NumPy community standards and protocols, it has become much easier to integrate any libraries that share the already well-known NumPy API. Such changes allow libraries like Dask, known for its easy-to-use parallelization and distributed computing capabilities, to defer some of that work to other libraries such as CuPy, providing users the benefits from both distributed and GPU computing with little to no change in their existing software built using the NumPy API.
Complex analytics should work as nimbly on extremely large data sets as on small ones. You don’t want to think about whether your data fits in-memory, about parallelism, or formatting data for math packages. You’d like to use your favorite analytical language and have it transparently scale up to Big Data volumes.
Paradigm4 presents a webinar about SciDB—the massively scalable, open source, array database with native complex analytics, integrated with R and Python.
Details:
Presenter: Bryan Lewis, Chief Data Scientist, Paradigm4
Day/Time: Tuesday November 12th, 2013 at 1pm EST
Learn how SciDB enables you to:
-Explore rich data sets interactively
-Do complex math in-database—without being constrained -by memory limitations
-Perform multi-dimensional windowing, filtering, and aggregation
-Offload large computations to a commodity hardware cluster—on-premise or in a cloud
-Use R and Python to analyze SciDB arrays as if they were R or Python objects.
-Share data among users, with multi-user data integrity guarantees and version control
Webinar Agenda:
-Introduction to SciDB
-Demo
-Live Q&A
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
Big data analysis using Tajo on AWS (Hands-on session)
- presented by Young-kyong Ko, data analyst at Gruter
- at Gruter TECHDAY 2014 (Oct. 29 Seoul, Korea)
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
Beyond EXPLAIN: Query Optimization From Theory To CodeYuto Hayamizu
EXPLAIN is too much explained. Let's go "beyond EXPLAIN".
This talk will take you to an optimizer backstage tour: from theoretical background of state-of-the-art query optimization to close look at current implementation of PostgreSQL.
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...Juan Cruz Nores
On-the-fly bytecode generation is generally known to be super efficient, but also super difficult to implement and debug. Instead of trying to generate bytecode for the JVM, you can leverage the built-in Java compiler; generate Java code as a string, compile that to bytecode and then have that executed. This gives you better code efficiency, is easier to implement, and is straight-forward to debug. We’ll cover on-the-fly code generation, execution and debugging, working with HotSpot and G1 using dynamic code, as well as how to optimize for engineer implementation time; maximum gain in minimum time. We’ll use practical examples and code snippets, so you can be ready to make the core processing for your business 10x faster.
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDSSumant Tambe
When it comes to sending data across a network, applications send either binary or self-describing data (XML). Both approaches have merits. Data Distribution Service (DDS) combines the best of both in what’s called “data-centric messaging”. DDS shares the type description once, upfront, and later on sends binary data that meets the type description. You typically use IDL or XSD to specify the types and run them through a code generator for type-safe wrapper APIs for your application in your programming language. Simple and fast! As it turns out, however, C++11 bends the rules once again. In this presentation you will learn about a template-based C++11 messaging library that gives the DDS code generator a run for its money. The types and objects in your C++11 application are mapped to standard DDS X-Types type descriptions and serialized format, respectively, using template meta-programming. If you have never heard about SFINAE you won’t stop talking about it after you see "overloading in overdrive" in this presentation. What’s more? I will share my newfound hatred for std::vector of bool/enums. This presentation will cover DDS-XTypes, DDS_TypeCode, DDS_DynamicData, STL, type_traits, Boost Fusion, and overloading with enable_if (lots and lots of it!).
A long time ago in a galaxy far, far away...
Java open source developers managed to the see the previously secret plans to the Empire's ultimate weapon, the JAVA™ COLLECTIONS FRAMEWORK.
Evading the dreaded Imperial Starfleet, a group of freedom fighters investigate the performance of the Empire’s most popular weapons: LinkedList, ArrayList and HashMap. In addition, they investigate common developer errors and bugs to help protect their vital software. With this new found knowledge they strike back!
Pursued by the Empire's sinister agents, JDuchess races home aboard her JVM, investigating proposed future changes to the Java Collections and other options such as Immutable Collections which could save her people and restore freedom to the galaxy....
The search for faster computing remains of great importance to the software community. Relatively inexpensive modern hardware, such as GPUs, allows users to run highly parallel code on thousands, or even millions of cores on distributed systems.
Building efficient GPU software is not a trivial task, often requiring a significant amount of engineering hours to attain the best performance. Similarly, distributed computing systems are inherently complex. In recent years, several libraries were developed to solve such problems. However, they often target a single aspect of computing, such as GPU computing with libraries like CuPy, or distributed computing with Dask.
Libraries like Dask and CuPy tend to provide great performance while abstracting away the complexity from non-experts, being great candidates for developers writing software for various different applications. Unfortunately, they are often difficult to be combined, at least efficiently.
With the recent introduction of NumPy community standards and protocols, it has become much easier to integrate any libraries that share the already well-known NumPy API. Such changes allow libraries like Dask, known for its easy-to-use parallelization and distributed computing capabilities, to defer some of that work to other libraries such as CuPy, providing users the benefits from both distributed and GPU computing with little to no change in their existing software built using the NumPy API.
Complex analytics should work as nimbly on extremely large data sets as on small ones. You don’t want to think about whether your data fits in-memory, about parallelism, or formatting data for math packages. You’d like to use your favorite analytical language and have it transparently scale up to Big Data volumes.
Paradigm4 presents a webinar about SciDB—the massively scalable, open source, array database with native complex analytics, integrated with R and Python.
Details:
Presenter: Bryan Lewis, Chief Data Scientist, Paradigm4
Day/Time: Tuesday November 12th, 2013 at 1pm EST
Learn how SciDB enables you to:
-Explore rich data sets interactively
-Do complex math in-database—without being constrained -by memory limitations
-Perform multi-dimensional windowing, filtering, and aggregation
-Offload large computations to a commodity hardware cluster—on-premise or in a cloud
-Use R and Python to analyze SciDB arrays as if they were R or Python objects.
-Share data among users, with multi-user data integrity guarantees and version control
Webinar Agenda:
-Introduction to SciDB
-Demo
-Live Q&A
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto,
Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi.
In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.
SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata
We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial
We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
Scalable Data Science in Python and R on Apache Sparkfelixcss
In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R?
We will start with PySpark, beginning with a quick walkthrough of data preparation practices and an introduction to Spark MLLib Pipeline Model. We will also discuss how to integrate native Python packages with Spark.
Compare to PySpark, SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages.
Python
R
Apache Spark
ML
DL
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
INTERFACE by apidays 2023 - Open Source ML, Omar Sanseviero, Hugging Faceapidays
INTERFACE by apidays 2023
APIs for a “Smart” economy. Embedding AI to deliver Smart APIs and turn into an exponential organization
June 28 & 29, 2023
Open Source ML - from pretrained models to production
Omar Sanseviero, Machine Engineering Lead, Hugging Face
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Similar to What's new in Apache Hivemall v0.5.0 (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
3. What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform
VersatileScalableEase-of-use
32018/4/17 Hivemall meetup
4. Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
100+ lines
of code
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop
42018/4/17 Hivemall meetup
5. Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive
52018/4/17 Hivemall meetup
13. 13
•Squared Loss
•Quantile Loss
•Epsilon Insensitive Loss
•Squared Epsilon Insensitive
Loss
•Huber Loss
Generic Classifier/Regressor
Available Loss functions
•HingeLoss
•LogLoss (synonym: logistic)
•SquaredHingeLoss
•ModifiedHuberLoss
• L1
• L2
• ElasticNet
• RDA
Other options
For Binary Classification:
For Regression:
• SGD
• AdaGrad
• AdaDelta
• ADAM
Optimizer
• Iteration support
• mini-batch
• Early stopping
Regularization
2018/4/17 Hivemall meetup
14. 2018/4/17 Hivemall meetup 14
-eta0 <arg> The initial learning rate [default 0.1]
-iter,--iterations <arg> The maximum number of iterations [default: 10]
-lambda <arg> Regularization term [default 0.0001]
-loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss,
SquaredHingeLoss, ModifiedHuberLoss, or
a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss,
SquaredEpsilonInsensitiveLoss, HuberLoss]
-mini_batch,--mini_batch_size <arg> Mini batch size [default: 1].
Expecting the value in range [1,100] or so.
-opt,--optimizer <arg> Optimizer to update weights
[default: adagrad, sgd, adadelta, adam]
-reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet]
Generic Classifier/Regressor Hyperparameters
Adagrad+RDA by the default
19. Training of RandomForest
19
Good news: Sparse Vector Input (Libsvm
format) is supported since v0.5.0 in
addition Dense Vector!
2018/4/17 Hivemall meetup
train_randomforest_classifier(array<double|string> features, int label [, const string
options, const array<double> classWeights])
20. • Dense Vector (array<double>)
• Sparse Vector (array<string>) in a LIBSVM format
• feature := <index>[“:”<value>]
where index := <integer> starting with 1 (index = 0 is reserved for bias clause)
and value := <floating point> (default 1.0 if not provided)
2018/4/17 Hivemall meetup 20
Supported Feature Vector Format of Random Forests
1.0, 0.0, 3.0
1:1.0, 2:0.0, 3:3.0
1:1.0, 3:3.0
select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999",
"movieid#2331"));
["1828616:3.3","6238429:4.999","6238429"]
1:1.0, 3
22. 2018/4/17 Hivemall meetup 22
Random Forests Taining Hyperparameters
-attrs,--attribute_types <arg> Comma separated attribute types (Q
for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C])
-depth,--max_depth <arg> The maximum number of the tree depth
[default: Integer.MAX_VALUE]
-leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes
[default: Integer.MAX_VALUE]
-min_samples_leaf <arg> The minimum number of samples in a
leaf node [default: 1]
-rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR]
-seed <arg> seed value in long [default: -1 (random)]
-splits,--min_split <arg> A node that has greater than or
equals to `min_split` examples will split [default: 2]
-stratified,--stratified_sampling Enable Stratified sampling for unbalanced data
-subsample <arg> Sampling rate in range (0.0,1.0]
-trees,--num_trees <arg> The number of trees for each task [default: 50]
-vars,--num_variables <arg> The number of random selected
features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length)
is considered if num_variable is (0.0,1.0]
26. Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins
262018/4/17 Hivemall meetup
33. Efficient algorithm for finding change point and outliers from
time-series data
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder
332018/4/17 Hivemall meetup
36. Efficient algorithm for finding change point and outliers from
timeseries data
Anomaly/Change-point Detection by ChangeFinder
J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
362018/4/17 Hivemall meetup
37. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation
372018/4/17 Hivemall meetup
41. ü Word2Vec support
ü Multi-class Logistic Regression
ü Field-aware Factorization Machines
ü SLIM recommendation
ü Merge Brickhouse UDFs
ü XGBoost support
ü LightGBM support
ü Gradient Boosting
Future work for v0.5.2 and later
41
PR#91
PR#116
PR#58
PR#111
2018/4/17 Hivemall meetup
PR#135
44. 44
SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training_data
XGBoost support in Hivemall
SELECT rowed, AVG(predicted) as predicted
FROM (
-- predict with each model
SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
-- join each test record with each model
FROM xgboost_models CROSS JOIN test_data_with_id
) t
GROUP BY rowid;
2018/4/17 Hivemall meetup
Experimental
Not yet supported in TD
45. Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
Try our the first Apache release (v0.5.0)!
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin
452018/4/17 Hivemall meetup
46. Any feature request or questions?
BTW, we are hiring!
462018/4/17 Hivemall meetup