Jubatus is an open source software framework for distributed online machine learning on big data. It focuses on performing real-time deeper analysis through online machine learning algorithms that can be run in a distributed manner by locally updating models and periodically mixing them together. This allows for fast, scalable, and memory-efficient deep learning on large, streaming datasets without requiring data storage or sharing across nodes.
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
The document discusses Hadoop and its uses for large-scale data processing and analysis. It provides examples of how Hadoop is used by Yahoo and in other enterprise settings for tasks like ETL processing, fraud detection, and cluster analysis. The document also introduces Greenplum HD, an enterprise-ready Hadoop platform that is faster and more reliable than Apache Hadoop.
Demystifying Systems for Interactive and Real-time AnalyticsDataWorks Summit
A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
The document discusses Hadoop and its uses for large-scale data processing and analysis. It provides examples of how Hadoop is used by Yahoo and in other enterprise settings for tasks like ETL processing, fraud detection, and cluster analysis. The document also introduces Greenplum HD, an enterprise-ready Hadoop platform that is faster and more reliable than Apache Hadoop.
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
This document provides an introduction to distributed computing engines for data processing. It discusses what distributed computing systems are and how they address the problem of data and tasks being too large for a single machine. It then covers key distributed computing systems like Hadoop, Spark and Flink. For each system, it summarizes what it is, when and where it originated, why it was created, and how it works at a high level. It also provides brief examples of common use cases for each system today.
This document discusses big data and Hadoop. It defines big data as very large data measured in petabytes. It explains that Hadoop is an open source framework used to store, process, and analyze huge amounts of unstructured data across clusters of computers. The key components of Hadoop are HDFS for storage, YARN for job scheduling, and MapReduce for parallel processing. Hadoop provides advantages like speed, scalability, low cost, and fault tolerance.
The document provides an introduction to Hadoop and distributed computing, describing Hadoop's core components like MapReduce, HDFS, HBase and Hive. It explains how Hadoop uses a map-reduce programming model to process large datasets in a distributed manner across commodity hardware, and how its distributed file system HDFS stores and manages large amounts of data reliably. Functional programming concepts like immutability and avoiding state changes are important to Hadoop's ability to process data in parallel across clusters.
This document introduces BioCloud, a tool for using cloud computing platforms like Hadoop to process large biological datasets in parallel. It discusses how biology applications are becoming more resource-intensive and how cloud platforms can provide scalable computing resources at a lower cost than local hardware. It provides an overview of Hadoop and MapReduce as an framework for processing vast amounts of data across clusters of machines. Examples of companies using Hadoop include Google, Yahoo, and Facebook for applications involving terabytes of data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems posed by large and complex datasets that cannot be processed by traditional systems. Hadoop uses HDFS for storage and MapReduce for distributed processing of data in parallel. Hadoop clusters can scale to thousands of nodes and petabytes of data, providing low-cost and fault-tolerant solutions for big data problems faced by internet companies and other large organizations.
This document discusses Hadoop and big data. It begins with definitions of big data and how Hadoop can help with large, complex datasets. It then discusses how Hadoop works with other tools like Pig and Hive. The document outlines different scenarios for big data and whether Hadoop is suitable. It also discusses how big data frameworks have evolved from Google papers. Finally, it provides examples of big data use cases and how education is being democratized with big data tools.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.
Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
1) The document provides an overview of a guest lecture on data-intensive processing with Hadoop MapReduce.
2) It discusses why "Big Data" is important in science, engineering, and commerce due to the increasing amounts of data being generated.
3) The lecture then explains how MapReduce and distributed file systems like HDFS enable parallel processing of large datasets across clusters of computers.
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
This document provides an introduction to distributed computing engines for data processing. It discusses what distributed computing systems are and how they address the problem of data and tasks being too large for a single machine. It then covers key distributed computing systems like Hadoop, Spark and Flink. For each system, it summarizes what it is, when and where it originated, why it was created, and how it works at a high level. It also provides brief examples of common use cases for each system today.
This document discusses big data and Hadoop. It defines big data as very large data measured in petabytes. It explains that Hadoop is an open source framework used to store, process, and analyze huge amounts of unstructured data across clusters of computers. The key components of Hadoop are HDFS for storage, YARN for job scheduling, and MapReduce for parallel processing. Hadoop provides advantages like speed, scalability, low cost, and fault tolerance.
The document provides an introduction to Hadoop and distributed computing, describing Hadoop's core components like MapReduce, HDFS, HBase and Hive. It explains how Hadoop uses a map-reduce programming model to process large datasets in a distributed manner across commodity hardware, and how its distributed file system HDFS stores and manages large amounts of data reliably. Functional programming concepts like immutability and avoiding state changes are important to Hadoop's ability to process data in parallel across clusters.
This document introduces BioCloud, a tool for using cloud computing platforms like Hadoop to process large biological datasets in parallel. It discusses how biology applications are becoming more resource-intensive and how cloud platforms can provide scalable computing resources at a lower cost than local hardware. It provides an overview of Hadoop and MapReduce as an framework for processing vast amounts of data across clusters of machines. Examples of companies using Hadoop include Google, Yahoo, and Facebook for applications involving terabytes of data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems posed by large and complex datasets that cannot be processed by traditional systems. Hadoop uses HDFS for storage and MapReduce for distributed processing of data in parallel. Hadoop clusters can scale to thousands of nodes and petabytes of data, providing low-cost and fault-tolerant solutions for big data problems faced by internet companies and other large organizations.
This document discusses Hadoop and big data. It begins with definitions of big data and how Hadoop can help with large, complex datasets. It then discusses how Hadoop works with other tools like Pig and Hive. The document outlines different scenarios for big data and whether Hadoop is suitable. It also discusses how big data frameworks have evolved from Google papers. Finally, it provides examples of big data use cases and how education is being democratized with big data tools.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
This document summarizes a survey paper on MapReduce processing using Hadoop. It discusses how big data is growing rapidly due to factors like the internet and social media. Traditional databases cannot handle big data. Hadoop uses MapReduce and HDFS to store and process extremely large datasets across commodity servers in a distributed manner. HDFS stores data in a distributed file system, while MapReduce allows parallel processing of that data. The paper describes the MapReduce process and its core functions like map, shuffle, reduce. It explains how Hadoop provides advantages like scalability, cost effectiveness, flexibility and parallel processing for big data.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines.
Objective 1: Understand the benefits of virtualizing Hadoop.
After this session you will be able to:
Objective 2: Understand how to get started with Pivotal HD Hadoop .
Objective 3: Understand where to find more information.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
1) The document provides an overview of a guest lecture on data-intensive processing with Hadoop MapReduce.
2) It discusses why "Big Data" is important in science, engineering, and commerce due to the increasing amounts of data being generated.
3) The lecture then explains how MapReduce and distributed file systems like HDFS enable parallel processing of large datasets across clusters of computers.
PFN福田圭祐による東大大学院「融合情報学特別講義Ⅲ」(2022年10月19日)の講義資料です。
・Introduction to Preferred Networks
・Our developments to date
・Our research & platform
・Simulation ✕ AI
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
The Microsoft 365 Migration Tutorial For Beginner.pptx
Jubatus Invited Talk at XLDB Asia
1. Distributed Online Machine Learning
Framework for Big Data
Shohei Hido
Preferred Infrastructure, Inc. Japan.
XLDB Asia, June 22nd, 2012
2. Preferred Infrastructure (PFI): to bring
cutting-edge research advances to products
l Founded: March, 2006, located in Tokyo, Japan
l Employees: 28
l Top university graduates including ICPC world finalists
l Mid-career engineers from Sony, IBM, Yahoo!, Sun
Information retrieval Distributed computing
Natural language
Machine learning
processing
2
4. Overview:
Big Data analytics will go real-time and deeper
1. Bigger data
2. More in real-time
3. Deep analysis
No storage
No data sharing
Only mix model
5. Jubatus: OSS platform for Big Data analytics
l Joint development with NTT laboratory in Japan
l Project started April 2011
l Released as an open source software
l Just released 0.3.0
l You can download it from
l http://github.com/jubatus/
l Waiting for your contribution and collaboration
5
6. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
6
7. Increasing demand in Big Data applications:
Real-time deeper analysis
l Current focus: aggregation and rule processing on bigger data
l CEP (Complex Event Processing) for real-time processing
l Hadoop/MapReduce for distributed computation
l Future: deeper analysis for rapid decisions and actions
l Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012]
l Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]
Data size
What will
Hadoop come?
CEP
Deep
Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
7
analysis
http://www.computerworlduk.com/news/networking/3302464/
8. Key technology: Machine learning
l Examples need rapid decisions under uncertainty
l Anomaly detection from M2M sensor data
l Energy demand forecast / Smart grid optimization
l Security monitoring on raw Internet traffic
l What is missing for fast & deep analytics on Big Data?
l Online/real-time machine learning platform
l + Scale-out distributed machine learning platform
1. Bigger data
2. More in real-time
3. Deep analysis
9. Online machine learning in Jubatus
l Batch learning
l Scan all data before building a model
l Data must be stored in memory or storage
Model
l Online learning
l Model will be updated by each data sample
l Sometimes with theory that the online model
converges to the batch model
Model
9
10. Jubatus focuses on latest online algorithms
l Advantage: fast and not memory-intensive
l Low latency & high throughput
l No need for storing large datasets
l Eg. Linear classification algorithms
l Perceptron (1958)
l Passive Aggressive (PA) (2003) Very recent
progress
l Confidence Weighted Learning (CW) (2008)
l AROW (2009)
l Normal HERD (NHERD) (2010)
10
11. Online learning or distributed learning:
No unified solution has been available
l Jubatus combines them into a unified computation framework
Real-time/
Online
Online ML alg.: Jubatus
PA [2003] 2011-
CW[2008]
Large scale
Small scale &
Stand-alone Distributed/
Parallel
WEKA Mahout computing
1993- 2006-
SPSS
1988-
Batch
11
12. What Jubatus currently supports
l Classification (multi-class)
l Perceptron / PA / CW / AROW
l Regression
l PA-based regression
l Nearest neighbor
l LSH / MinHash / Euclid LSH
l Recommendation
l Based on nearest neighbor
l Anomaly detection*
l LOF based on nearest neighbor
l Graph analysis*
l Shortest path / Centrality (PageRank)
l Some simple statistics
12
13. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
13
14. Hadoop and Mahout: Not good for online learning
l Hadoop
l Advantage
l Many extensions for a variety of applications
l Good for distributed data storing and aggregation
l Disadvantage
l No direct support for machine learning and online processing
l Mahout
l Advantage
l Popular machine learning algorithms are implemented
l Disadvantage
l Some implementation are less mature
l Still not capable of online machine learning
14
15. Jubatus vs. Hadoop, RDB-based, and Storm:
Advantage in online AND distributed ML
l Only Jubatus satisfies both of them at the same time
Jubatus Hadoop RDB Storm
Storing ✓ ✓✓ ✓
✓
Big Data External DB HDFS Ext. DB
Batch ✓ ✓✓
✓ ✕
learning Mahout SPSS, etc
Stream
✓ ✕ ✕ ✓✓
processing
Distributed ✓
✓✓ ✕ ✕
learning Mahout
High Online
importance
✓✓ ✕ ✕ ✕
learning
15
16. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
16
17. How to make online algorithms distributed?
=> No trivial!
Batch learning
Online learning
Learn Learn
Easy to
the update parallelize Model update
Learn
Model update Model update
Hard to Learn
Learn
parallelize Model update
the update
due to
Learn
frequent updates
Time
Model update Model update
l Online learning requires frequent model updates
l Naïve distributed architecture leads to too many
synchronization operations
l It causes performance problems in terms of network
communications and accuracy
17
18. Solution: Loose model sharing
l Jubatus only shares the local models in a loose manner
l Model size << Data size
l Jubatus DOES NOT share datasets
l Unique approach compared to existing framework
l Local models can be different on the servers
l Different models will be gradually merged
Model Model Model
Mixed Mixed Mixed
model model model
19. Three fundamental operations on Jubatus:
UPDATE, ANALYZE, and MIX
1. UPDATE
l Receive a sample, learn and update the local model
2. ANALYZE
l Receive a sample, apply the local model, return result
3. MIX (called automatically in backend)
l Exchange and merge the local models between servers
l C.f. Map-Shuffle-Reduce operations on Hadoop
l Algorithms can be implemented independently from
l Distribution logic
l Data sharing
l Failover
19
20. UPDATE
l Each server starts from an initial model
l Each data sample are sent to one (or two) servers
l Local models updated based on the sample
l Data samples are NEVER shared
Distributed
randomly
Local
or consistently
Initial
model
model
1
Local
model Initial
model
2
20
21. MIX
l Each server sends its model diff
l Model diffs are merged and distributed
l Only model diffs are transmitted
Local Model Model
Initial Merged Initial Mixed
model -
model =
diff diff
diff +
model =
model
1 1 1 Merged
+
=
diff
Local Model Model
Initial Merged Initial Mixed
model -
2
model =
diff diff
diff +
model =
model
2 2
21
22. UPDATE (iteration)
l Locally updated models after MIX are discarded
l Each server starts updating from the mixed model
l The mixed model improves gradually thanks to all of the servers
Distributed
randomly
Local
or consistently
Mixed
model
model
1
Local
model Mixed
model
2
22
23. ANALYZE
l For prediction, each sample randomly goes to a server
l Server applies the current mixed model to the sample
l The prediction will be returned to the client
Distributed
randomly
Mixed
model
Return prediction
Mixed
model
Return prediction
23
24. Why Jubatus can work in real-time?
l Focus on online machine learning
l Make online machine learning algorithms distributed
l Update locally
l Online training without communication with others
l Mix only models globally
l Small communication cost, low latency, good performance
l Advantage compared to costly Shuffle in MapReduce
l Analyze locally
l Each server has mixed model
l Low latency for making predictions
l Everything in-memory
l Process data on-the-fly
24
25. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
25
26. Demo: Twitter analysis using natural language
processing and machine learning
Jubatus classifies each tweet from Twitter data stream into pre-defined
categories. Only one Jubatus server is enough to classify over 5,000 QPS,
which is close to the raw Twitter data. We provide a browser-based GUI.
26
27. Experiment: Estimation of power consumption
Jubatus learns the power usage and network data flow pattern of
certain servers. The power consumption of individual servers can be
estimated in real-time by monitoring and analyzing packets without
having to install power measurement modules on all servers.
Predicted value (W)
Data Center /
Office Estimation
Power
No power meter meter
Actual value (W)
TAP
(Packet data)
Consumption differs for
different types of packets
28. Agenda
l What’s missing for Big Data analytics
l Comparison with existing software
l Inside Jubatus: Update, Analyze, and Mix
l Jubatus demo
l Summary
28
29. Summary
l Jubatus is the first OSS platform for online
distributed machine learning on Big Data streams.
l Download it from http://github.com/jubatus/
l We welcome your contribution and collaboration
1. Bigger data
2. More in real-time
3. Deep analysis
No storage
No data sharing
Only mix model