This document discusses Hivemall, a scalable machine learning library for Apache Hive. It provides concise summaries of machine learning algorithms as user-defined functions that can run on large datasets in Hive. The document outlines the motivation for Hivemall, what algorithms it supports, how to use it to perform tasks like data preparation, training models, and prediction, and how it handles iterations efficiently using map-only shuffling. Experimental results show that Hivemall can improve prediction accuracy compared to non-iterative training while maintaining acceptable performance overhead.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm.
Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k
From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/
If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Technological Geeks Video 13 :-
Video Link :- https://youtu.be/mfLxxD4vjV0
FB page Link :- https://www.facebook.com/bitwsandeep/
Contents :-
Hive Architecture
Hive Components
Limitations of Hive
Hive data model
Difference with traditional RDBMS
Type system in Hive
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
This presentation focuses on how Flickr recently brought scalable computer vision and deep learning to Flickr's multi-billion collection of photos using Apache Hadoop and Storm.
Video of the presentation is available here: https://www.youtube.com/watch?v=OrhAbZGkW8k
From the May 14, 2014 San Francisco Hadoop User Group meeting. For more information on the SFHUG see http://www.meetup.com/hadoopsf/
If you'd like to work on interesting challenges like this at Flickr in San Francisco, please see http://www.flickr.com/jobs
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
Hopsworks is a platform for designing and operating End to End Machine Learning using PySpark and TensorFlow/PyTorch. Early access is now available on GCP. Hopsworks includes the industry's first Feature Store. Hopsworks is open-source.
In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.
What you will learn:
How to get started with Koalas
Easy transition from Pandas to Koalas on Apache Spark
Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
Single machine Pandas vs distributed environment of Koalas
Prerequisites:
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Python 3 and pip pre-installed
pip install koalas from PyPI
Pre-register for Databricks Community Edition
Read koalas docs
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Technological Geeks Video 13 :-
Video Link :- https://youtu.be/mfLxxD4vjV0
FB page Link :- https://www.facebook.com/bitwsandeep/
Contents :-
Hive Architecture
Hive Components
Limitations of Hive
Hive data model
Difference with traditional RDBMS
Type system in Hive
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
Spark AI Summit Europe 2019 talk: Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy. How can you do directed search efficiently with Spark? The answer is Maggy - asynchronous directed search on PySpark.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
PDF, audio, and voiceover are now available on designintechreport.wordpress.com
Today’s most beloved technology products and services balance design and engineering in a way that perfectly blends form and function. Businesses started by designers have created billions of dollars of value, are raising billions in capital, and VC firms increasingly see the importance of design. The third annual Design in Tech Report examines how design trends are revolutionizing the entrepreneurial and corporate ecosystems in tech. This report covers related M&A activity, new patterns in creativity × business, and the rise of computational design.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeIan Lumb
Outline:
- The Apache Project's 4-step upgrade process for its Hadoop distro
- Upgrade processes for the Hadoop stack involving Apache Ambari and other management tools
- Bright roles for Hadoop service definition, assignment and composition
- The 1-step, 0-downtime Bright upgrade process for Hadoop distros and the analytics stack
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
This presentation aims to compare the performance of ETL pipeline using Spark and Hive under Azure. We will examine the features, strengths, and weaknesses of each tool, and provide recommendations on which one to use based on specific use cases.
Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
Where can I find the best Hadoop training and placement program?
To get the best training for Hadoop, I would suggest you go ahead with OPTnation as they will prepare you completely for interviews that will help you in landing your dream company that you are looking for after training completion. They have trainers who have trained thousands of candidates from fresher to experienced level and helped them in starting their career in this booming technology. Their course is 100% job oriented and they will provide you 100% placement assistance as well to land your dream company as many of their students have already done.
Also, go to their website where you will find thousands of big data Hadoop jobs for freshers.
Where can I find the best Hadoop training and placement program?
Where can I find hadoop bigdata jobs?
Where can I find big data Hadoop jobs for freshers?
Where can i find hadoop bigdata jobs?
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
Erin LeDell's presentation on Scalable Ensemble Learning with H2O at Strata + Hadoop World San Jose, 03.29.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
The real estate market is one of the most competitive in terms of pricing, and as a result, prices tend to vary significantly based on a variety of factors. Forecasting property prices is an important module in decision making for both buyers and investors in supporting budget allocation, finding property finding stratagems, and determining suitable policies, making it one of the top fields to apply the concepts of machine learning to optimise and predict the prices with high accuracy.
The literature study provides a clear concept and will benefit any next endeavours. The majority of writers have come to the conclusion that artificial neural networks are more effective at forecasting the future, but in the actual world, there are other algorithms that should have been taken into account. In order to maximise profits, investors base their judgments on market trends. Developers are curious in future trends because it might help them weigh the advantages and downsides and assist them create new products.
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Hivemall talk@Hadoop summit 2014, San Jose
1. National Institute of Advanced Industrial Science
and Technology (AIST), Japan
Makoto YUI
m.yui@aist.go.jp, @myui
Hivemall: Scalable Machine Learning
Library for Apache Hive
Hadoop Summit 2014, San Jose
1 / 43
2. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
2 / 43
3. What is Hivemall
• A collection of machine learning algorithms
implemented as Hive UDFs/UDTFs
• Classification & Regression
• Recommendation
• k-Nearest Neighbor Search
.. and more
• An open source project on Github
• Licensed under LGPL
• github.com/myui/hivemall (bit.ly/hivemall)
• 4 contributors
Hadoop Summit 2014, San Jose
3 / 43
6. Hadoop Summit 2014, San Jose
Motivation – Why a new ML framework?
Mahout?
Vowpal Wabbit?
(w/ Hadoop streaming)
Spark MLlib?
0xdata H2O? Cloudera Oryx?
Machine Learning frameworks out there
that run with Hadoop
Quick Poll:
How many people in this room are using them?
6 / 43
7. Framework User interface
Mahout Java API Programming
Spark MLlib/MLI Scala API programming
Scala Shell (REPL)
H2O R programming
GUI
Cloudera Oryx Http REST API programming
Vowpal Wabbit
(w/ Hadoop streaming)
C++ API programming
Command Line
Hadoop Summit 2014, San Jose
Motivation – Why a new ML framework?
Existing distributed machine learning frameworks
are NOT easy to use
7 / 43
8. Hadoop Summit 2014, San Jose
Classification with Mahout
org/apache/mahout/classifier/sgd/TrainNewsGroups.java
Find the complete code at
bit.ly/news20-mahout
8 / 43
9. Hadoop Summit 2014, San Jose
Why Hivemall
1. Ease of use
• No programming
• Every machine learning step is done within HiveQL
• No compilation/packaging overhead
• Easy for existing Hive users
• You can evaluate Hivemall within 5 minutes or so
• Installation is just as follows
9 / 43
10. Hadoop Summit 2014, San Jose
Why Hivemall
2. Scalable to data
• Scalable to # of training/testing instances
• Scalable to # of features
• Built-in support for feature hashing
• Scalable to the size of prediction model
• Suppose there are 200 labels * 100 million
features ⇒ Requires 150GB
• Hivemall does not need a prediction model fit
in memory both in the training/prediction
• Feature engineering step is also scalable
and parallelized using Hive
10 / 43
11. Hadoop Summit 2014, San Jose
Why Hivemall
3. Scalable to computing resources
• Exploiting the benefits of Hadoop &
Hive
• Provisioning the machine learning
service on Amazon Elastic MapReduce
• Provides an EMR bootstrap for the
automated setup
Find an example on
bit.ly/hivemall-emr
11 / 43
12. Hadoop Summit 2014, San Jose
Why Hivemall
4. Supports the state-of-the-art online
learning algorithms (for classification)
• Less configuration parameters
(no learning rate as one in SGD)
• CW, AROW[1], and SCW[2] are not yet
supported in the other ML frameworks
• Surprising fast convergence properties
(few iterations is enough)
1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 2009
2. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012
12 / 43
13. Hadoop Summit 2014, San Jose
Why Hivemall
Algorithms
News20.binary
Classification Accuracy
Perceptron 0.9460
Passive-Aggressive
(a.k.a. Online-SVM)
0.9604
LibLinear 0.9636
LibSVM/TinySVM 0.9643
Confidence Weighted (CW) 0.9656
AROW [1] 0.9660
SCW [2] 0.9662
Better
4. Supports the state-of-the-art online
learning algorithms (for classification)
CW-variants are very smart online ML algorithm
13 / 43
14. Hadoop Summit 2014, San Jose
Why CW variants are so good?
Suppose a binary classification setting to classify
sentences positive or negative
→ learn the weight for each word (each word is a feature)
I like this authorPositive
I like this author, but found this book dullNegative
Label Feature Vector
Naïve update will reduce both at same rateWlike Wdull
CW-variants adjust weights at different rates
14 / 43
15. Hadoop Summit 2014, San Jose
Why CW variants are so good?
weight
weight
Adjust a weight
Adjust a weight &
confidence
0.6 0.80.6
0.80.6
At this confidence,
the weight is 0.5
Confidence
(covariance)
0.5
15 / 43
16. Hadoop Summit 2014, San Jose
Why Hivemall
4. Supports the state-of-the-art online
learning algorithms (for classification)
• Fast convergence properties
• Perform small update where confidence
is enough
• Perform large update where confidence is
low (e.g., at the beginning)
• A few iterations are enough
16 / 43
17. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
17 / 43
18. Hadoop Summit 2014, San Jose
What Hivemall can do
• Classification (both one- and multi-class)
Perceptron
Passive Aggressive (PA)
Confidence Weighted (CW)
Adaptive Regularization of Weight Vectors (AROW)
Soft Confidence Weighted (SCW)
• Regression
Logistic Regression using Stochastic Gradient Descent (SGD)
PA Regression
AROW Regression
• k-Nearest Neighbor & Recommendation
Minhash and b-Bit Minhash (LSH variant)
Brute-force search using similarity measures (cosine similarity)
• Feature engineering
Feature hashing
Feature scaling (normalization, z-score)
18 / 43
19. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Data preparation
19 / 43
20. Hadoop Summit 2014, San Jose
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
20 / 43
21. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Feature Engineering
21 / 43
22. Hadoop Summit 2014, San Jose
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
22 / 43
23. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Training
23 / 43
24. Hadoop Summit 2014, San Jose
How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
24 / 43
25. Hadoop Summit 2014, San Jose
How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
25 / 43
26. Hadoop Summit 2014, San Jose
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
26 / 43
27. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Prediction
27 / 43
28. Hadoop Summit 2014, San Jose
How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
28 / 43
29. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
29 / 43
30. Implemented machine learning algorithms as User-
Defined Table generating Functions (UDTFs)
Hadoop Summit 2014, San Jose
How Hivemall works in the training
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
tuple
<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation
<feature, weights>
param-mix param-mix
Training
table
Shuffle
by feature
train train
Friendly to the Hive relational
query engine
• Resulting prediction model is
a relation of feature and its
weight
Embarrassingly parallel
• # of mapper and reducers are
configurable
Bagging-like effect which helps
to reduce the variance of each
classifier/partition
30 / 43
31. Hadoop Summit 2014, San Jose
train train
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
merge
tuple
<label, array<features >
array<weight>
array<sum of weight>,
array<count>
Training
table
Prediction model
-1, <2,7, 9>
..
+1, <3,8>
final
merge
merge
-1, <2,7, 9>
..
+1, <3,8>
train train
array<weight>
Why not UDAF (as one in MADLib)
4 ops in parallel
2 ops in parallel
No parallelism
Machine learning as an aggregate function
Bottleneck in the final merge
Throughput limited by its fan out
Memory
consumption
grows
Parallelism
decreases
31 / 43
32. How to deal with Iterations
Iterations are mandatory to get a good prediction
model
• However, MapReduce is not suited for iterations because
IN/OUT of MR job is through HDFS
• Spark avoid it by in-memory computation
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Input
32 / 43
33. val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
Repeated MapReduce steps
to do gradient descent
For each node, loads data
in memory once
This is just a toy example! Why?
Training with Iterations in Spark
Logistic Regression example of Spark
Input to the gradient computation should be shuffled
for each iteration (without it, more iteration is required)
33 / 43
34. Hadoop Summit 2014, San Jose
What MLlib actually do?
Val data = ..
for (i <- 1 to numIterations) {
val sampled =
val gradient =
w -= gradient
}
Mini-batch Gradient Descent with Sampling
Iterations are mandatory for convergence because
each iteration uses only small fraction of data
GradientDescent.scala
bit.ly/spark-gd
sample subset of data (partitioned RDD)
averaging the subgradients over the sampled data
using Spark MapReduce
34 / 43
35. How to deal with Iterations in Hivemall
Hivemall provides the amplify UDTF to enumerate
iteration effects in machine learning without several
MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY RANDOM
35 / 43
36. Map-only shuffling and amplifying
rand_amplify UDTF randomly shuffles the
input rows for each Map task
CREATE VIEW training_x3
as
SELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *)
as (rowid, label, features)
FROM
training;
36 / 43
37. Detailed plan w/ map-local shuffle
…
Shuffle
(distributed by feature)
Reducetask
Merge
Aggregate
Reduce write
Maptask
Table scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Maptask
Table scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Reducetask
Merge
Aggregate
Reduce write
Scanned entries
are amplified and
then shuffled
Note this is pipeline op.
The Rand Amplifier operator is interleaved between
the table scan and the training operator
37 / 43
38. Hadoop Summit 2014, San Jose
Method
ELAPSED TIME
(sec)
AUC
Plain 89.718 0.734805
amplifier+clustered by
(a.k.a. global shuffle)
479.855 0.746214
rand_amplifier
(a.k.a. map-local shuffle)
116.424 0.743392
Performance effects of amplifiers
For map-local shuffle, prediction accuracy
got improved with an acceptable overhead
38 / 43
39. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
39 / 43
40. Experimental Evaluation
Compared the performance of our batch learning
scheme to state-of-the-art machine learning
techniques, namely Bismarck and Vowpal Wabbit
• Dataset
KDD Cup 2012, Track 2 dataset, which is one of the largest
publically available datasets for machine learning, provided
by a commercial search engine provider
• The training data is about 235 million records in 33 GB
• # of feature dimensions is about 54 million
• Task
Predicting Click-Through-Rates of search engine ads
• Experimental Environment
In-house 33 commodity servers (32 slaves nodes for Hadoop)
each equipped with 8 processors and 24 GB memory
40
bit.ly/hivemall-kdd-dataset
40 / 43
41. Hadoop Summit 2014, San Jose
116.4
596.67
493.81
755.24
0
100
200
300
400
500
600
700
800
Hivemall VW1 VW32 Bismarck
0.64
0.66
0.68
0.7
0.72
0.74
0.76
Hivemall VW1 VW32 Bismarck
Throughput: 2.3 million tuples/sec on 32 nodes
Latency: 96 sec for training 235 million records of 23 GB
Performance comparison
Prediction performance
(AUC) is good
Elapsed time (sec) for training
The lower, the better
41 / 43
42. Hadoop Summit 2014, San Jose
val training = MLUtils.loadLibSVMFile(sc,
"hdfs://host:8020/small/training_libsvmfmt", multiclass = false)
val model = LogisticRegressionWithSGD.train(training, numIterations)
..
How about Spark 1.0 MLlib
Works fine for small data (10k training examples in about 1.5 MB)
on 33 nodes with allocating 5 GB memory to each worker
LoC is small and easy to understand
However, Spark does not work for large dataset
(235 million training example of 2^24 feature dimensions in
about 33 GB)
Further investigation is required
42 / 43
43. Hadoop Summit 2014, San Jose
Conclusion
Hivemall is an open source library that provides a
collection of machine learning algorithms as Hive
UDFs/UDTFs
Easy to use
Scalable to computing resources
Runs on Amazon EMR
Support state of the art classification algorithms
Plan to support Shark/Spark SQL
Project Site:
github.com/myui/hivemall or bit.ly/hivemall
Message of this talk: Please evaluate Hivemall by yourself.
5 minutes is enough for a quick start
Slide available on
bit.ly/hivemall-slide
43 / 43