MapR Architecture Presentation given at Strata + Hadoop World 2013 by MapR CTO & Co-Founder M.C. Srivas
Prior to co-founding MapR, Srivas ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were used extensively. He wanted to provide that powerful capability to everyone, and started MapR on his vision to build the next-generation platform for Big Data. His strategy was to evolve Hadoop and bring simplicity of use, extreme speed and complete reliability to Hadoop users everywhere, and make it seamlessly easy for enterprises to use this powerful new way to get deep insights. Srivas brings to MapR his experiences at Google, Spinnaker Networks, Transarc in building game-changing products that advance the state of the art.
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
Describes the thinking behind MapR's architecture. MapR"s Hadoop achieves better reliability on commodity hardware compared to anything on the planet, including custom, proprietary hardware from other vendors. Apache HDFS and Cassandra replication is also discussed, as are SAN and NAS storage systems like Netapp and EMC.
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
Provides an overview of M7, which is the first unified data platform for tables and files. Does a deep dive into the MapR architecture, especially containers, and how M7 tables integrates with the rest of MapR architecture, including volumes, management and Hadoop.
Describes some of the problems with Apache HBase, and how M7 from MapR solves many of these issues.
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
Describes the thinking behind MapR's architecture. MapR"s Hadoop achieves better reliability on commodity hardware compared to anything on the planet, including custom, proprietary hardware from other vendors. Apache HDFS and Cassandra replication is also discussed, as are SAN and NAS storage systems like Netapp and EMC.
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
Provides an overview of M7, which is the first unified data platform for tables and files. Does a deep dive into the MapR architecture, especially containers, and how M7 tables integrates with the rest of MapR architecture, including volumes, management and Hadoop.
Describes some of the problems with Apache HBase, and how M7 from MapR solves many of these issues.
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
Details the first ever Exabyte-scale system that can hold a Trillion large files. Describes MapR's Distributed NameNode (tm) architecture, and how it scales very easily and seamlessly. Shows map-reduce performance across a variety of benchmarks like dfsio, pig-mix, nnbench, terasort and YCSB.
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
Examine the unique features of the MapR Converged Data Platform and how they can support production-grade enterprise machine learning - Ends with a live demo using H2O - Presented at Hadoop Summit Tokyo 2016
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. In this session, we will be presenting the architecture and design of the next generation of MapReduce and will delve into the details of the architecture that makes it much easier to innovate. The architecture will have built in HA, security and multi-tenancy to support many users on the larger clusters. It will also increase innovation, agility and hardware utilization. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARNHBaseCon
In this talk, you'll learn how Rocket Fuel has developed various HBase access patterns and multi-tenancy scenarios and the role of DeathStar, an in-house solution built on top of Apache Slider and YARN. We'll cover how we use a single YARN cluster to host multiple smaller and highly customized HBase clusters, and how dynamic provisioning and elastic scaling are made possible in this model.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
This talk takes a technological deep dive into MapR M7 including information on some of the key challenges that were solved during the implementation of M7. MapR's M7 is a clean room replication of the HBase API written in C++ and fully integrated into the MapR platform.
In the process of implementing M7, we learned some lessons and solved some interesting challenges. Ted Dunning shares some of these experiences and lessons. Many of these lessons apply across the board to high performance query systems in general and can be applied much more widely. Some of the resulting techniques have already been adopted by the Apache Drill project, but there are lots more places that these techniques can be used.
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
Examine the unique features of the MapR Converged Data Platform and how they can support production-grade enterprise machine learning - Ends with a live demo using H2O - Presented at Hadoop Summit Tokyo 2016
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. In this session, we will be presenting the architecture and design of the next generation of MapReduce and will delve into the details of the architecture that makes it much easier to innovate. The architecture will have built in HA, security and multi-tenancy to support many users on the larger clusters. It will also increase innovation, agility and hardware utilization. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARNHBaseCon
In this talk, you'll learn how Rocket Fuel has developed various HBase access patterns and multi-tenancy scenarios and the role of DeathStar, an in-house solution built on top of Apache Slider and YARN. We'll cover how we use a single YARN cluster to host multiple smaller and highly customized HBase clusters, and how dynamic provisioning and elastic scaling are made possible in this model.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
This talk takes a technological deep dive into MapR M7 including information on some of the key challenges that were solved during the implementation of M7. MapR's M7 is a clean room replication of the HBase API written in C++ and fully integrated into the MapR platform.
In the process of implementing M7, we learned some lessons and solved some interesting challenges. Ted Dunning shares some of these experiences and lessons. Many of these lessons apply across the board to high performance query systems in general and can be applied much more widely. Some of the resulting techniques have already been adopted by the Apache Drill project, but there are lots more places that these techniques can be used.
MapR is an amazing new distributed filesystem modeled after Hadoop. It maintains API compatibility with Hadoop, but far exceeds it in performance, manageability, and more.
/* Ted's MapR meeting slides incorporated here */
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising.
Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster.
This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.
Redundancy for Big Hadoop Clusters is hard - Stuart PookEvention
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Introducing Galera Cluster & the Codership Team
Galera Cluster in a nutshell:
True multi-master:
Read & write to any node
* Synchronous replication
* No slave lag
* No integrity issues
* No master-slave failovers or VIP needed
* Multi-threaded slave, no performance penalty
* Automatic node provisioning
Elastic:
Easy scale-out & scale-in, all nodes read-write
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
Similar to Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance (20)
How Data-Driven Approaches are Changing Your Data Management Strategies
Introducing data-driven strategies into your business model alters the way your organization manages and provides information to your customers, partners and employees. Gone are the days of “waterfall” implementation strategies from relational data to applications within a data center. Now, data-driven business models require agile implementation of applications based on information from all across an organization–on-premises, cloud, and mobile–and includes information from outside corporate walls from partners, third-party vendors, and customers. Data management strategies need to be ready to meet these challenges or your new and disruptive business models will fail at the most critical time: when your customers want to access it.
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
How Rendezvous Architecture Improves Evaluation in the Real World
In this addition of our machine learning logistics webinar series we build on the ideas of the key requirements for effective management of machine learning logistics presented in the Overview webinar and in Part I Workshop. Here we focus on model-to-model comparison & evaluation, use of decoy models and more. Listen here: http://info.mapr.com/machine-learning-workshop2.html?_ga=2.35695522.324200644.1511891424-416597139.1465233415
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
MapR has launched the MapR Data Science Refinery which leverages a scalable data science notebook with native platform access, superior out-of-the-box security, and access to global event streaming and a multi-model NoSQL database.
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
Big data technologies are being applied to a wide variety of use cases. We will review tangible examples of machine learning, discuss an autonomous driving project and illustrate the role of MapR in next generation initiatives. More: http://info.mapr.com/WB_Machine-Learning-for-Chickens_Global_DG_17.11.02_RegistrationPage.html
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
Having heard the high-level rationale for the rendezvous architecture in the introduction to this series, we will now dig in deeper to talk about how and why the pieces fit together. In terms of components, we will cover why streams work, why they need to be persistent, performant and pervasive in a microservices design and how they provide isolation between components. From there, we will talk about some of the details of the implementation of a rendezvous architecture including discussion of when the architecture is applicable, key components of message content and how failures and upgrades are handled. We will touch on the monitoring requirements for a rendezvous system but will save the analysis of the recorded data for later. Listen to the webinar on demand: https://mapr.com/resources/webinars/machine-learning-workshop-1/
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
Join Ellen Friedman, co-author (with Ted Dunning) of a new short O’Reilly book Machine Learning Logistics: Model Management in the Real World, to look at what you can do to have effective model management, including the role of stream-first architecture, containers, a microservices approach and a DataOps style of work. Ellen will provide a basic explanation of a new architecture that not only leverages stream transport but also makes use of canary models and decoy models for accurate model evaluation and for efficient and rapid deployment of new models in production.
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
Data warehouses have been the standard tool for analyzing data created by business operations. In recent years, increasing data volumes, new types of data formats, and emerging analytics technologies such as machine learning have given rise to modern data lakes. Connecting application databases, data warehouses, and data lakes using real-time data pipelines can significantly improve the time to action for business decisions. More: http://info.mapr.com/WB_MapR-StreamSets-Data-Warehouse-Modernization_Global_DG_17.08.16_RegistrationPage.html
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
For this talk we will explore the power of streaming real time events in the context of the IoT and smart cities.
http://info.mapr.com/WB_Streaming-Real-Time-Events_Global_DG_17.08.02_RegistrationPage.html
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
Deploying storage with a forklift is so 1990s, right? Today’s applications and infrastructure demand systems and services that scale. Customers require performance and capacity to fit the use case and workloads, not the other way around. Architects need multi-temperature, multi-location, highly available, and compliance friendly platforms that grow with the generational shift in data growth and utility.
Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a service. Though originally used within the telecommunications industry, it has become common practice for banks, ISPs, insurance firms, and other verticals. More: http://info.mapr.com/WB_PredictingChurn_Global_DG_17.06.15_RegistrationPage.html
The prediction process is data-driven and often uses advanced machine learning techniques. In this webinar, we'll look at customer data, do some preliminary analysis, and generate churn prediction models – all with Spark machine learning (ML) and a Zeppelin notebook.
Spark’s ML library goal is to make machine learning scalable and easy. Zeppelin with Spark provides a web-based notebook that enables interactive machine learning and visualization.
In this tutorial, we'll do the following:
Review classification and decision trees
Use Spark DataFrames with Spark ML pipelines
Predict customer churn with Apache Spark ML decision trees
Use Zeppelin to run Spark commands and visualize the results
An Introduction to the MapR Converged Data PlatformMapR Technologies
Listen to the webinar on-demand: http://info.mapr.com/WB_Partner_CDP_Intro_EMEA_DG_17.05.31_RegistrationPage.html
In this 90-minute webinar, we discuss:
- The MapR Converged Data Platform and its components
- Use cases for the Converged Data Platform
- MapR Converged Partner Program
- How to get started with MapR
- Becoming a partner
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
IT budgets are shrinking, and the move to next-generation technologies is upon us. The cloud is an option for nearly every company, but just because it is an option doesn’t mean it is always the right solution for every problem.
Most cloud providers would prefer that every customer be tightly coupled with their proprietary services and APIs to create lock-in with that cloud provider. The savvy customer will leverage the cloud as infrastructure and stay loosely bound to a cloud provider. This creates an opportunity for the customer to execute a multicloud strategy or even a hybrid on-premises and cloud solution.
Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations. Along the way, Jim discusses security, backups, event streaming, databases, replication, and snapshots across a variety of use cases that run most businesses today.
Is your organization at the analytics crossroads? Have you made strides collecting and sharing massive amounts of data from electronic health records, insurance claims, and health information exchanges but found these efforts made little impact on efficiency, patient outcomes, or costs?
Changes in how business is done combined with multiple technology drivers make geo-distributed data increasingly important for enterprises. These changes are causing serious disruption across a wide range of industries, including healthcare, manufacturing, automotive, telecommunications, and entertainment. Technical challenges arise with these disruptions, but the good news is there are now innovative solutions to address these problems. http://info.mapr.com/WB_Geo-distributed-Big-Data-and-Analytics_Global_DG_17.05.16_RegistrationPage.html
MapR announced a few new releases in 2017, and we want to go over those exciting new products and features that are available now. We’d like to invite our customers and partners to this webinar in which members of the MapR product team will share details about the latest updates.
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
SAP® HANA and SAP® IQ are popular platforms for various analytical and transactional use cases. If you’re an SAP customer, you’ve experienced the benefits of deploying these solutions. However, as data volumes grow, you’re likely asking yourself: How do I scale storage to support these applications? How can I have one platform for various applications and use cases?
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
SAP HANA is an increasingly popular platform for various analytical and transactional use cases with its in-memory architecture. If you’re an SAP customer you’ve experienced the benefits.
However, the underlying storage for SAP HANA is painfully expensive. This slows down your ability to grow your SAP HANA footprint and serve up more applications.
You’re not the only one still loading your data into data warehouses and building marts or cubes out of it. But today’s data requires a much more accessible environment that delivers real-time results. Prepare for this transformation because your data platform and storage choices are about to undergo a re-platforming that happens once in 30 years.
With the MapR Converged Data Platform (CDP) and Cisco Unified Compute System (UCS), you can optimize today’s infrastructure and grow to take advantage of what’s next. Uncover the range of possibilities from re-platforming by intimately understanding your options for density, performance, functionality and more.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Elevating Tactical DDD Patterns Through Object Calisthenics
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
1. Shift into High Gear:
Dramatically Improve Hadoop
and NoSQL Performance
M. C. Srivas, CTO/Founder
1
2. MapR History
MapR M7
Launches ultra-reliable
(struggling with Nutch)
Copies MR paper,
calls it Hadoop
06
uses Hadoop
publishes
MapReduce
paper
MapR founded
‘11
‘09
07
‘12
MapR
partners with
Uses
Hadoop
Uses Hadoop
Fastest NoSQL
on the planet
DOW drops
from 14,300
to 6,800
2
Breaks all world
records!
‘13
2500 nodes
Largest
non-Web
installation
3. MapR Distribution for Apache Hadoop
Tools
Web 2.0
Enterprise Apps
Analytics
Operations
Vertical Apps
OLTP
Management
Hardening, Testing, Integrating,
Innovating, Contributing as part
of the
Apache Open Source Community
NoSQL
Data Platform
Batch
99.999%
HA
Data
Protection
Interactive
Disaster
Recovery
Enterprise
Integration
3
Real Time
Scalability &
Performance
Multi-Tenancy
4. MapR Distribution for Apache Hadoop
100% Apache Hadoop
With significant enterprisegrade enhancements
Comprehensive
management
Industry-standard
interfaces
Higher performance
4
5. The Cloud Leaders Pick MapR
Amazon EMR is the largest
Hadoop provider in revenue
and # of clusters
Google chose MapR to
provide Hadoop on
Google Compute Engine
MapR partnership with Canonical and Mirantis
provides advantages for OpenStack deployments
5
7. How to Make a Cluster Reliable?
Make the storage reliable
1.
–
Recover from disk and node failures
Make services reliable
2.
–
–
–
Services need to checkpoint their state rapidly
Restart failed service, possibly on another node
Move check-pointed state to restarted service, using (1) above
Do it fast
3.
–
–
Instant-on … (1) and (2) must happen very, very fast
Without maintenance windows
No compactions (e.g., Cassandra, Apache HBase)
• No “anti-entropy” that periodically wipes out the cluster (e.g., Cassandra)
•
7
8. Reliability with Commodity Hardware
No NVRAM
Cannot assume special connectivity
–
Cannot even assume more than 1 drive per node
–
No RAID possible
Use replication, but …
–
–
No separate data paths for "online" vs. replica traffic
Cannot assume peers have equal drive sizes
Drive on first machine is 10x larger than drive on other?
No choice but to replicate for reliability
8
9. Reliability via Replication
Replication is easy, right?
All we have to do is send the same bits to the master and replica
Clients
Primary
Server
Clients
Primary
Server
Replica
Normal replication, primary forwards
Replica
Cassandra-style replication
9
10. But Crashes Occur…
When the replica comes back, it is stale
– It must be brought up-to-date
– Until then, exposed to failure
Clients
Clients
Who
re-syncs?
Primary
Server
Primary
Server
Replica
Primary re-syncs replica
Replica
Replica remains stale until
"anti-entropy" process
kicked off by administrator
10
11. Unless it’s HDFS …
HDFS solves the problem a third way
Makes everything read-only
–
static data, trivial to re-sync
Single writer, no reads allowed while writing
File close is the transaction that allows readers to see data
–
–
Realtime not possible with HDFS
–
–
Unclosed files are lost
Cannot write any further to closed file
To make data visible, must close file immediately after writing
Too many files is a serious problem with HDFS (a well documented
limitation)
HDFS therefore cannot do NFS, ever
–
No “close” in NFS … can lose data any time
11
12. This is the 21st Century…
To support normal apps, need full read/write support
Let's return to the issue: re-sync the replica when it comes back
Clients
Clients
Who
re-syncs?
Primary
Server
Primary
Server
Replica
Primary re-syncs replica
Replica
Replica remains stale until
"anti-entropy" process
kicked off by administrator
12
13. How Long to Re-sync?
24 TB / server
–
–
Did you say you want to do this online?
–
–
@ 1000MB/s = 7 hours
Practical terms, @ 200MB/s = 35 hours
Throttle re-sync rate to 1/10th
350 hours to re-sync (= 15 days)
What is your Mean Time To Data Loss (MTTDL)?
–
–
How long before a double disk failure?
A triple disk failure?
13
14. Traditional Solutions
Use dual-ported disk to side-step this problem
Clients
Servers use
NVRAM
Primary
Server
Raid-6 with idle spares
Replica
Dual Ported
Disk Array
COMMODITY HARDWARE
LARGE SCALE CLUSTERING
Large Purchase Contracts, 5-year spare-parts plan
14
16. What MapR Does
Chop the data on each node to 1000s of pieces
–
–
Not millions of pieces, only 1000s
Pieces are called containers
Spread replicas of each container across the cluster
16
18. MapR Replication Example
100-node cluster
Each node holds 1/100th of every node's data
When a server dies, entire cluster re-syncs the dead node's data
18
19. MapR Re-sync Speed
–
–
–
99 nodes re-sync'ing in parallel
Net speed up is about 100x
–
99x number of drives
99x number of Ethernet ports
99x CPUs
Each is re-sync'ing 1/100th of the
data
19
350 hours vs. 3.5
MTTDL is
100x better
20. MapR Reliability
Mean Time To Data Loss (MTTDL) is far better
–
Does not require idle spare drives
–
–
Rest of cluster has sufficient spare capacity to absorb one node’s data
On a 100-node cluster, 1 node’s data == 1% of cluster capacity
Utilizes all resources
–
–
Improves as cluster size increases
no wasted “master-slave” nodes
no wasted idle spare drives … all spindles put to use
Better reliability with less resources
–
on commodity hardware!
20
22. MapR's Read-write Replication
Writes are synchronous
–
client2
client1
Visible immediately
clientN
Data is replicated in a "chain"
fashion
–
Utilizes full-duplex network
client2
client1
Meta-data is replicated in a
"star" manner
–
Response time better
22
clientN
23. Container Balancing
• Servers keep a bunch of containers "ready to go”
• Writes get distributed around the cluster
23
As data size increases, writes
spread more (like dropping a
pebble in a pond)
Larger pebbles spread the
ripples farther
Space balanced by moving
idle containers
24. MapR Container Re-sync
MapR is 100% random write
– uses distributed transactions
Complete
crash
On a complete crash,
all replicas diverge from
each other
On recovery, which one
should be master?
24
25. MapR Container Re-sync
MapR can detect exactly where
replicas diverged
– even at 2000 MB/s update rate
Re-sync means
– roll-back each to divergence point
– roll-forward to converge with chosen
master
Done while online
–
with very little impact on normal
operations
25
New master
after crash
26. MapR Does Automatic Re-sync Throttling
Re-sync traffic is “secondary”
Each node continuously measures RTT to all its peers
More throttle to slower peers
–
Idle system runs at full speed
All automatically
26
28. MapR's Containers
Files/directories are sharded and placed into
containers
Each container contains
Directories & files
Data blocks
28
Replicated on servers
Containers are 1632 GB segments of
disk, placed on
nodes
Automatically managed
29. MapR’s Architectural Params
HDFS 'block'
Unit of I/O
–
4K/8K (8K in MapR)
10^3
map-red
i/o KB
Unit of Chunking (a mapreduce split)
–
10-100's of megabytes
–
Unit of Resync (a replica)
–
–
–
–
10-100's of gigabytes
container in MapR
automatically managed
29
10^9
admin
Unit of Administration (snap,
repl, mirror, quota, backup)
–
10^6
resync
1 gigabyte - 1000's of terabytes
volume in MapR
what data is affected by my
missing blocks?
30. Where & How Does MapR Exploit This
Unique Advantage?
30
31. MapR's No NameNode HATM Architecture
HDFS Federation
MapR (Distributed Metadata)
NAS
appliance
E
E
F
E
F
E
F
C
E
D
F
B
NameNode
NameNode
NameNode
NameNode
NameNode
NameNode
A
DataNode
DataNode
DataNode
C
D
E
D
B
B
C
E
B
A
DataNode
F
A
A
D
C
F
B
F
DataNode
DataNode
• HA w/automatic failover
• Instant cluster restart
• Up to 1T files (> 5000x advantage)
• 10-20x higher performance
• 100% commodity hardware
• Multiple single points of failure
• Limited to 50-200 million files
• Performance bottleneck
• Commercial NAS required
31
32. Relative Performance and Scale
Other
Advantage
Rate (creates/s)
14-16K
335-360
40x
Scale (files)
6B
1.3M
4615x
File creates/s
MapR
18000
Other distribution
0
MapR distribution
16000
400
350
300
250
200
150
100
50
0
0.5
1
1.5
Files (M)
File creates/s
14000
12000
Benchmark: 100-byte file
creates
10000
8000
6000
4000
2000
Other distribution
0
0
0
1000
100
2000
200
3000
400
Files (M)
4000
600
5000
800
32
6000
1000
Hardware: 10 nodes, each
2 x 4 cores
24 GB RAM
12 x 1 TB 7200 RPM
1 GigE NIC
33. Where & How Does MapR Exploit This
Unique Advantage?
33
34. MapR’s NFS Allows Direct Deposit
Web
Server
Random Read/Write
Compression
Database
Server
Distributed HA
…
Application
Server
Connectors not needed
No extra scripts or clusters to deploy and maintain
34
35. Where & How Does MapR Exploit This
Unique Advantage?
35
36. MapR Volumes & Snapshots
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson
Volumes dramatically simplify the
management of Big Data
•
•
•
•
•
•
Replication factor
Scheduled mirroring
Scheduled snapshots
Data placement control
User access and tracking
Administrative permissions
100K volumes are
OK, create as many as
desired!
36
37. Where & How Does MapR Exploit This
Unique Advantage?
37
38. MapR M7 Tables
Binary compatible with Apache HBase
–
–
no recompilation needed to access M7 tables
Just set CLASSPATH
M7 tables accessed via pathname
–
–
–
openTable( "hello") … uses HBase
openTable( "/hello") … uses M7
openTable( "/user/srivas/hello") … uses M7
38
39. M7 Tables
M7 tables integrated into storage
–
Unlimited number of tables
–
Apache HBase is typically 10-20 tables (max 100)
No compactions
–
always available on every node, zero admin
update in place
Instant-on
–
Zero recovery time
5-10x better performance
Consistent low latency
–
MapR M7
At 95th & 99th percentiles
39
40. YCSB 50-50 Mix (throughput)
M7 (ops/sec)
Other HBase 0.94.9
latency (usec)
Other HBase 0.94.9
(ops/sec)
40
M7 Latency (usec)
42. Where & How Does MapR Exploit This
Unique Advantage?
43
43. MapR Makes Hadoop Truly HA
ALL Hadoop components have High Availability
–
e.g. YARN
ApplicationMaster (old JT) and TaskTracker record their state in
MapR
On node-failure, AM recovers its state from MapR
–
All jobs resume from where they were
–
Works even if entire cluster restarted
Only from MapR
Allows preemption
–
–
MapR can preempt any job, without losing its progress
ExpressLane™ feature in MapR exploits it
44
44. MapR Does MapReduce (Fast!)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
45
45. MapR Does MapReduce (Faster!)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.65
1.5 TB in 59 seconds
2103 nodes
300
46
46. YOUR CODE
Can Exploit MapR’s Unique Advantages
ALL your code can easily be scale-out HA
–
–
Save service-state in MapR
Save data in MapR
Use Zookeeper to notice service failure
Restart anywhere, data+state will move there
automatically
That’s what we did!
Only from MapR: HA for
Impala, Hive, Oozie, Storm, MySQL, SOLR/Lucene,
HCatalog, Kafka, …
47
47. MapR: Unlimited Hadoop
Reliable Compute
Dependable Storage
Build cluster brick by brick, one node at a time
Use commodity hardware at rock-bottom prices
Get enterprise-class reliability: instantrestart, snapshots, mirrors, no single-point-of-failure, …
Export via NFS, ODBC, Hadoop and other standard
protocols
48
Editor's Notes
MapR provides a complete platform for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. All of these packages with MapR updates are available on Github.We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.
The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
Another major advantage with MapR is the distributed Namenode. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is between 70-100M. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
With MapR’s Direct Access NFS we use industry standard protocol to load data into the cluster. There are no clients that need to be deployed. No special agents. If writing over NFS is it highly available. We have that. Do you need to compress it. We do it automatically. We support random read/write.
Another aspect of ease of use is the support for volumes. Only MapR provides the logical volume construct to group content. Policies can be used to enforce user access, admin permissions, data protection polices, etc. Data placement can also be controlled. So I have certain nodes that have SSD drives. Other Hadoop distros require you to have permissions at the individual files or across the entire cluster.MapR’s Hadoop distribution includes support for Volumes - a unique idea that dramatically simplifies the management of your big data. Volumes allow you to efficiently define data management policies. With other Hadoop distributions you either set policies for the entire cluster or individual files. Setting policies by file or directories is too granular and hierarchical. There are just too many files. On the other hand, cluster-wide policies are not granular enough. Slide 2A volume is a logical constructthat lets you collect and refer to data files regardless of where the file is physically located, and with no concern for the relationship between the data files.