1. Hivemall is a scalable machine learning library built as a collection of Hive UDFs that allows users to perform machine learning tasks using SQL queries.
2. The document discusses why Hivemall was created, as the creator found existing frameworks like Mahout and Spark MLlib difficult to use for SQL users and not scalable. Hivemall allows machine learning tasks like training, prediction, and feature engineering to be done with SQL queries.
3. The document provides examples of how to use Hivemall for tasks like data preparation, feature engineering, model training using algorithms like logistic regression and confidence weighted classification, and prediction. It also discusses how models can be exported for real-time prediction on databases.
2018年11月5日(月)開催セミナー
DBを10分間で1000個構築するDB仮想化テクノロジーとは?
~Database as code in Devops~
講演資料です。
"What is DevOps"
Office of the CTO, Delphix Adam Bowen
Devopsとは何か?DevopsにおけるDB環境はどうあるべきか?Facebook,ebay,WallmartのDevpos事例を交えて、DevopsとDBのベストプラクティスを解説します。
[de:code 2019] [DP10] Build 2019 Azure AI & Data Platform 最新アップデートNaoki (Neo) SATO
de:code 2019 セッション「Build 2019 Azure AI & Data Platform 最新アップデート」
https://satonaoki.wordpress.com/2019/05/29/decode19-dp10-build-2019-azure-ai-data-updates/
Video
https://www.youtube.com/watch?v=pZ4jEliGYsc
2018年11月5日(月)開催セミナー
DBを10分間で1000個構築するDB仮想化テクノロジーとは?
~Database as code in Devops~
講演資料です。
"What is DevOps"
Office of the CTO, Delphix Adam Bowen
Devopsとは何か?DevopsにおけるDB環境はどうあるべきか?Facebook,ebay,WallmartのDevpos事例を交えて、DevopsとDBのベストプラクティスを解説します。
[de:code 2019] [DP10] Build 2019 Azure AI & Data Platform 最新アップデートNaoki (Neo) SATO
de:code 2019 セッション「Build 2019 Azure AI & Data Platform 最新アップデート」
https://satonaoki.wordpress.com/2019/05/29/decode19-dp10-build-2019-azure-ai-data-updates/
Video
https://www.youtube.com/watch?v=pZ4jEliGYsc
MongoDB World 2019: Implementation and Operationalization of MongoDB Sharding...MongoDB
Implementation and Operationalization of MongoDB Sharding in a Highly Complex and Secured Platform - Order Processing System at Cisco Systems Inc.
The order processing system at Cisco, with a large volume of order transactions, requires high performance, scalable, highly available, and secured environment. This was achieved by implementing a MongoDB sharded cluster on private cloud infrastructure with high security and automation tools.
Big Data Taiwan 2014 Keynote 4: Monetize Enterprise Data – Big Data 在台灣的經典應用與行動Etu Solution
講者:Etu 資深協理 | 陳育杰
簡介:過去這兩年內,Big Data 在企業的應用架構已逐漸形塑出來,我們看到,不同的產業,陸續開始運用 Hadoop 來解決不同的問題,而背後的 IT 架構,其實都具有一些共通性。我們將透過這些共通性的架構來探索 Big Data / Hadoop 具體展現的企業應用。
Overview of how NetApp IT Runs NetApp Technology in Their EnterpriseNetApp
Highlights on the NetApp on NetApp experience on the "why?" & "how?" the internal IT teams (both enterprise IT and engineering IT) use NetApp technology. And most importantly, the "results.”
The rise of big data governance: insight on this emerging trend from active o...DataWorks Summit
Each of today’s most forward-thinking enterprises have been forced to face similar data challenges: the reliance on real-time data to better serve their customers and, subsequently, the requirement of complying with regulations to protect that data – one example being the General Data Protection Regulation (GDPR).
The solution to this emerging challenge is a tricky one – for companies like ING, this data governance challenge has been met with metadata, a consistent view across a large heterogeneous ecosystem and collaboration with an active open source community.
This joint presentation, John Mertic – Director of ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Audience Takeaways include:
Understand the role of metadata;
Understand the need for a cross technology view on metadata;
Understand the role of Apache Atlas as a reference implementation; and
Understand the role of ODPi in offering value-added services including certification.
Speaker
John Mertic, Director of Program Management for ODPi, R Consortium, and Open Mainframe Project, The Linux Foundation
Random Forests: The Vanilla of Machine Learning - Anna QuachWithTheBest
This speech was about coding and forest trees, as well as how to use and apply these concepts to your work.
Anna Quach, PhD Student at Utah State University working under Dr. Adele Cutler
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
MongoDB World 2019: Implementation and Operationalization of MongoDB Sharding...MongoDB
Implementation and Operationalization of MongoDB Sharding in a Highly Complex and Secured Platform - Order Processing System at Cisco Systems Inc.
The order processing system at Cisco, with a large volume of order transactions, requires high performance, scalable, highly available, and secured environment. This was achieved by implementing a MongoDB sharded cluster on private cloud infrastructure with high security and automation tools.
Big Data Taiwan 2014 Keynote 4: Monetize Enterprise Data – Big Data 在台灣的經典應用與行動Etu Solution
講者:Etu 資深協理 | 陳育杰
簡介:過去這兩年內,Big Data 在企業的應用架構已逐漸形塑出來,我們看到,不同的產業,陸續開始運用 Hadoop 來解決不同的問題,而背後的 IT 架構,其實都具有一些共通性。我們將透過這些共通性的架構來探索 Big Data / Hadoop 具體展現的企業應用。
Overview of how NetApp IT Runs NetApp Technology in Their EnterpriseNetApp
Highlights on the NetApp on NetApp experience on the "why?" & "how?" the internal IT teams (both enterprise IT and engineering IT) use NetApp technology. And most importantly, the "results.”
The rise of big data governance: insight on this emerging trend from active o...DataWorks Summit
Each of today’s most forward-thinking enterprises have been forced to face similar data challenges: the reliance on real-time data to better serve their customers and, subsequently, the requirement of complying with regulations to protect that data – one example being the General Data Protection Regulation (GDPR).
The solution to this emerging challenge is a tricky one – for companies like ING, this data governance challenge has been met with metadata, a consistent view across a large heterogeneous ecosystem and collaboration with an active open source community.
This joint presentation, John Mertic – Director of ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Audience Takeaways include:
Understand the role of metadata;
Understand the need for a cross technology view on metadata;
Understand the role of Apache Atlas as a reference implementation; and
Understand the role of ODPi in offering value-added services including certification.
Speaker
John Mertic, Director of Program Management for ODPi, R Consortium, and Open Mainframe Project, The Linux Foundation
Random Forests: The Vanilla of Machine Learning - Anna QuachWithTheBest
This speech was about coding and forest trees, as well as how to use and apply these concepts to your work.
Anna Quach, PhD Student at Utah State University working under Dr. Adele Cutler
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Powering Realtime Decision Engines in Finance and Healthcare using Open Sour...Greg Makowski
http://www.kdd.org/kdd2015/industry-gov-talks.html
Financial services and healthcare companies could be the biggest beneficiaries of big data. Their realtime decision engines can be vastly improved by leveraging the latest advances in big data analytics. However, these companies are challenged in leveraging Open Software Systems (OSS). This presentation covers how, in collaboration with financial services and healthcare institutions, we built an OSS project to deliver a realtime decisioning engine for their respective applications. I will address two key issues. First, I will describe the strategy behind our hiring process to attract millennial big data developers and the results of this endeavor. Second, I will recount the collaboration effort that we had with our large clients and the various milestones we achieved during that process. I will explain the goals regarding big data analysis that our large clients presented to us and how we accomplished those goals. In particular, I will discuss how we leveraged open source to deliver a realtime decisioning software product called Kamanja to these institutions. An advantage of developing applications in Kamanja is that it is already integrated with Hadoop, Kafka for realtime data streaming, HBase and Cassandra for NoSQL data storage. I will talk about how these companies benefited from Kamanja and some of challenges we had in the design of this software. I will provide quantifiable improvements in key metrics driven by Kamanja and interesting, unsolved problems/challenges that need to be addressed for faster and wider adoption of OSS by these companies.
Big Data has been a "buzz word" for a few years now, and it's generated a fair amount of hype. But, while the technology landscape is still evolving, product companies in the software, web, and hardware areas have actually led the way in delivering real value from data sources like weblogs, sensors, and social media as well as systems like Hadoop, NoSQL, and Analytical Databases. These organizations have built "Big Data Apps" that leverage fast, flexible data frameworks to solve a wide array of user problems, scale to massive audiences, and deliver superior predictive intelligence.
Join this webinar to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies. You will hear from:
- Ben Hopkins, Product Marketing Manager at Pentaho, who will discuss what Big Data means for product strategy and why it represents a new toolset for product teams to meet user needs and build competitive advantage
- Jim Stascavage, VP of Engineering at ESRG, who will discuss how his company has innovated with Big Data and predictive analytics to deliver technology products that optimize fuel consumption and maintenance cycles in the maritime and heavy industry sectors, leveraging trillions of sensor data points a year.
Who Should Attend
Product Managers, Product Marketing Managers, Project Managers, Development Managers, Product Executives, and anyone responsible for addressing customer needs & influencing product strategy.
Strategic Direction Session: Deliver Next-Gen IT Ops with CA Mainframe Operat...CA Technologies
In this roadmap session, join us to explore how bleeding-edge data science algorithms are now being incorporated in real life into CA Mainframe Operational Intelligence to better predict performance issues and prevent costly downtime and capacity spikes across the IT landscape. See how you can get real-time insight to what may happen sooner and in-depth guidance on what you should do about it. And, learn how to use open tools that can pull in data feeders from other systems, to improve results. Whether you’re a mainframe novice or a seasoned operations expert, you'll find new tools can improve your SLA performance, MTTR and more.
For more information on Mainframe, please visit: http://ow.ly/pbDM50g68zT
Splunk AI & Machine Learning Roundtable 2019 - ZurichSplunk
Splunk Artificial Intelligence and Machine Learning Roundtable held in Zurich on November 6th 2019. Presented by Philipp Drieger, Staff Machine Learning Architect.
Check out this presentation from Pentaho and ESRG to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies.
Learn more in the brief that inspired the presentation, Product Innovation with Big Data: http://www.pentaho.com/resources/whitepaper/product-innovation-big-data
Legacy IBM Systems and Splunk: Security, Compliance and UptimePrecisely
Splunk is an industry leader in IT operations and security analytics – helping you make better, faster decisions with real-time visibility across the enterprise. If your critical business services rely on the mainframe or IBM i, it’s imperative that these systems are included in your Splunk environment.
Without them, you can have a significant blind spot that leading to security risks, failed audits, downtime and escalating costs.
Join our first-ever virtual seminar on 1st July at 10am BST / 11am CET to learn how to seamlessly integrate the mainframe and IBM i into Splunk for a true enterprise-wide view of your IT landscape.
Presenters include Colin Knight from NatWest, Alex Stuart from Splunk and Ian Hartley from Precisely.
During the online event, you will discover:
- How to leverage Splunk to improve enterprise IT security and IT operations
- Benefits and challenges of integrating mainframe and IBM i systems into the Splunk platform
- How Precisely Ironstream provides integration with Splunk without the need for mainframe or IBM i expertise
- The real-world experience of integrating mainframe data into Splunk at NatWest
Take Action: The New Reality of Data-Driven BusinessInside Analysis
The Briefing Room with Dr. Robin Bloor and WebAction
Live Webcast on July 23, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=360d371d3a49ad256942f55350aa0a8b
The waiting used to be the hardest part, but not anymore. Today’s cutting-edge enterprises can seize opportunities faster than ever, thanks to an array of technologies that enable real-time responsiveness across the spectrum of business processes. Early adopters are solving critical business challenges by enabling the rapid-fire design, development and production of very specific applications. Functionality can range from improved customer engagement to dynamic machine-to-machine interactions.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor, who will tout a new era in data-driven organizations, and why a data flow architecture will soon be critical for industry leaders. He’ll be briefed by Sami Akbay of WebAction, who will showcase his company’s real-time data management platform, which combines all the component parts needed to access, process and leverage data big and small. He’ll explain how this new approach can provide game-changing power to organizations of all types and sizes.
Visit InsideAnlaysis.com for more information.
Download our special report, IoT Tech for the Manager: http://bit.ly/report1-slideshare
Hey IT, Meet OT as presented at the IoT Inc Business' fifteenth Meetup. See: http://www.iot-inc.com/hey-it-meet-ot-meetup/
In our fifteenth Meetup we have Hima Mukkamala, Head of Engineering at Predix, GE Digital presenting “Hey IT, Meet OT”.
Presentation Abstract
Software has been the domain of information technology, but it is quickly becoming key to operations technology as well. Operating smart, networked machines from wind turbines to jet engines requires an intricate understanding of both the machines and the data and information that flows through them. The combination of these two disciplines is bringing new efficiencies and capabilities that do more—faster and cheaper. The key is leveraging connectivity, data, and mobility to optimize efficiency and deliver new services to customers. Join Hima Mukkamala of GE Digital to hear how software technology can help companies bridge the divide between IT and OT and how IT can help industrial companies build, deploy, and manage Industrial Internet applications that bring game-changing efficiencies to businesses.
Radical Optimization: How the Internet of Things, 3D Printing and Innovative ...Sustainable Brands
How do the Internet of Things, 3D printing and innovative data analysis promise to transform and revitalize some of the 'dirty' work of manufacturing and supply chains? How can brands use those developments to not only drive cost down, but also to create new promises and fulfill them? What sectors should watch out, and what kinds of new partnerships would make sense in this new world?
SF Big Analytics Meetup - Exact Count Distinct with Apache KylinSamanthaBerlant
With over 450 million customers, Didi (world’s largest rideshare company) conducts complex user behavior analysis on huge datasets daily. Exact Count Distinct is one of Didi’s most critical metrics, but it is known for being computationally heavy and notoriously slow. The difference between exact Count Distinct and approximate Count Distinct can cost Didi millions of dollars. In this talk, Kaige Liu of the Apache Kylin project will explain how Didi uses Apache Kylin to return exact Distinct Count on billions of rows of data with sub-second latency to generate the most accurate picture of its business.
You will also learn about the latest development in modern OLAP technologies. Kaige will share how Didi and Truck Alliance (a truck-hailing company that processes $100 billion worth of goods yearly) use Apache Kylin to power their analytics platforms that allow 100s of analysts to achieve sub-second latency on petabyte-scale data.
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?SnapLogic
Companies collect more data but struggle with how to glean the best insights. Use of Machine Learning also needs power data integration.
In this presentation, Janet Jaiswal, SnapLogic's VP of product marketing, reviews key strategies and technologies to deliver intelligent data via self-service ML models.
To learn more, visit https://www.snaplogic.com
This is the presentation I shared at the SAP Influencer Summit. The presentation discusses how we are seeing companies in APJ utilize our BI/Analytics solutions.
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStreamgogo6
Download our special report, IoT Tech for the Manager: http://bit.ly/report1-slideshare
IoT Meets Big Data: The Opportunities and Challenges as presented at the IoT Inc Business' Eighth Meetup. See: http://www.iot-inc.com/iot-meets-big-data-the-opportunities-and-challenges/
In our eighth Meetup we have Syed Hoda, Chief Marketing Officer of ParStream presenting “IoT Meets Big Data: The Opportunities and Challenges”. Come meet other business leaders in the IoT ecosystem and discuss the business issues you face in the Internet of Things.
Presentation Abstract
The Internet of Things (IoT) and Big Data have each made press headlines and continue to be board-level priorities. The intersection of IoT and Big Data is a fascinating area of innovation with tremendous scope for business impact. From industrial sensors to vehicles to health monitors, a huge variety of devices connects to the Internet and share information. At the same time, the cost to store data has dropped dramatically while capabilities for analysis have made huge leaps forward. How can analytics drive business benefits from IoT projects? What are the challenges in storing and analyzing huge amounts of real-world information? How can companies generate more value from their data? We will address these questions and also share our perspectives on innovative technologies enabling new IoT use cases.
When Downtime Isn’t an Option: Performance Optimization Analytics in the Era ...CA Technologies
In the era of the customer, preventing system downtime and fixing performance issues before they occur is more critical than ever. Discover how the power of IT analytics can help you deliver improved customer experience through root cause analysis and actionable insight to prevent problems before they occur and resolve them faster if they do. See how you can correlate data from multiple sources for real-time prediction and increased insights into the behavior of your mainframe systems and databases. Whether you are currently doing IT analytics manually, using a variety of tools, an expert or a novice, you won’t want to miss this informative session and learn about new analytics innovations that can help you reduce costs, increase operational efficiency and delight your customers.
For more information, please visit http://cainc.to/Nv2VOe
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...Tyler Wishnoff
See how to consistently deliver accurate COUNT DISTINCT queries in under a second, even on petabyte-scale datasets. This presentation will share Apache Kylin’s approach to COUNT DISTINCT queries for user behavior analysis. Learn more at: https://kyligence.io/
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...SamanthaBerlant
See how to consistently deliver accurate COUNT DISTINCT queries in under a second, even on petabyte-scale datasets. This presentation will share Apache Kylin’s approach to COUNT DISTINCT queries for user behavior analysis.
https://www.brighttalk.com/webcast/18317/414006
Similar to Hivemall dbtechshowcase 20160713 #dbts2016 (20)
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
2. ➢2015/04 Joined Treasure Data, Inc.
➢1st Research Engineer in Treasure Data
➢My mission in TD is developing ML-as-a-Service
(MLaaS)
➢2010/04-2015/03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan.
➢Worked on a large-scale Machine Learning project
and Parallel Databases
➢2009/03 Ph.D. in Computer Science from NAIST
➢XML native database and Parallel Database systems
Who am I ?
2
5. 1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. How to use Hivemall
Agenda
5
6. What is Hivemall
Scalable machine learning library built as a collection of Hive
UDFs, licensed under the Apache License v2
Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
6
7. Won IDG’s InfoWorld 2014
Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributeddata processing, data analytics,machine
learning,NoSQL databases,and the Hadoop ecosystem
(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)
bit.ly/hivemall-award
7
9. List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad(logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a
positive class
Factorization Machines is good
where features are sparse and
categorical ones
9
10. List of Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is
useful for recommending top-k items
10
12. Ø CTR prediction of Ad click logs
• Freakout Inc., Fan communication, and more
• Replaced Spark MLlib w/ Hivemall at company X
Industry use cases of Hivemall
http://www.slideshare.net/masakazusano75/sano-hmm-2015051212
13. ØGender prediction of Ad click logs
• Scaleout Inc. and Fan commucations
http://eventdots.jp/eventreport/458208
Industry use cases of Hivemall
13
14. Industry use cases of Hivemall
Ø Value prediction of Real estates
• Livesense
http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall 14
25. Framework User interface
Mahout Java API Programming
Spark MLlib/MLI Scala API programming
Scala Shell (REPL)
H2O R programming
GUI
Cloudera Oryx Http REST API programming
Vowpal Wabbit
(w/ Hadoop streaming)
C++ API programming
Command Line
Survey on existing ML frameworks
Existing distributed machine learning frameworks
are NOT easy to use
25
30. Create external table e2006tfidf_train(
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
30
34. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
34
35. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
35
37. How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
37
39. Export Prediction Model to a RDBMS
Any RDBMS
TD export
Periodical export is very easy
in Treasure Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
39
Prediction
Model
40. Real-time Prediction on MySQL
SIGMOID(x) = 1.0 / (1.0 + exp(-x))
Prediction
Model
Label
Feature Vector
SELECT
sigmoid(sum(t.value * m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
prediction_model m ON (t.feature = m.feature)
Online prediction on MySQL
Index lookups are very
efficient in RDBMSs
40