The document provides an introduction to Apache Drill, an open source SQL query engine for analysis of large-scale datasets across Hadoop, NoSQL and cloud storage systems. It discusses Tomer Shiran's role in Apache Drill, provides an agenda for the talk, describes the need for interactive analysis of big data and how existing solutions are limited. It then outlines Apache Drill's architecture, key features like full SQL support, optional schemas and support for nested data formats.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?inovex GmbH
Nachdem in den letzten Jahren NoSQL ein beherrschendes Thema im Kontext von Big Data war, gewinnt SQL als Anfragesprache wieder große Bedeutung im Hadoop-Umfeld. Dabei steht mit Hive ein SQL-Dialekt zur Verfügung, mit dem zwar leicht Batch-orientierte ETL-Strecken für Hadoop gebaut werden können, der aber bisher für interaktive Analysen nicht geeignet war Mit Impala, Presto, Spark SQL und dem Stinger-Projekt ändert sich das nun rapide. Schnelle verteilte Query Engines erlauben interaktive analytische Anfragen auf großen Datenmengen. Dazu kommen neue Speicherformate wie Parquet und ORC, die effizientere Repräsentation und schnelleren Zugriff versprechen. In dieser Session gebe ich einen Überblick über Stärken und Schwächen der verschiedenen Ansätze und Erfahrungen aus dem praktischen Einsatz.
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
Data virtualization, Data Federation & IaaS with Jboss TeiidAnil Allewar
Enterprise have always grappled with the problem of information silos that needed to be merged using multiple data warehouses(DWs) and business intelligence(BI) tools so that enterprises could mine this disparate data for businessdecisions and strategy. Traditionally this data integration was done with ETL by consolidating multiple DBMS into a single data storage facility.
Data virtualization enables abstraction, transformation, federation, and delivery of data taken from variety of heterogeneous data sources as if it is a single virtual data source without the need to physically copy the data for integration. It allows consuming applications or users to access data from these various sources via a request to a single access point and delivers information-as-a-service (IaaS).
In this presentation, we will explore what data virtualization is and how it differs from the traditional data integration architecture. We’ll also look at validating the data virtualization and federation concepts by working through an example(see videos at the GitHub repo) to federate data across 2 heterogeneous data sources; mySQL and MongoDB using the JBoss Teiid data virtualization platform.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Big Data in die Cloud auslagern? Warum und wenn ja, bei welchem Provider? Anhand von vier Beispielen können Sie eine geeignete Lösung finden. Verglichen werden AWS, Google Cloud, IBM Bluemix und Microsoft Azure
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
In this hands-on Apache Flink presentation, you will learn in a step-by-step tutorial style about:
• How to setup and configure your Apache Flink environment: Local/VM image (on a single machine), cluster (standalone), YARN, cloud (Google Compute Engine, Amazon EMR, ... )?
• How to get familiar with Flink tools (Command-Line Interface, Web Client, JobManager Web Interface, Interactive Scala Shell, Zeppelin notebook)?
• How to run some Apache Flink example programs?
• How to get familiar with Flink's APIs and libraries?
• How to write your Apache Flink code in the IDE (IntelliJ IDEA or Eclipse)?
• How to test and debug your Apache Flink code?
• How to deploy your Apache Flink code in local, in a cluster or in the cloud?
• How to tune your Apache Flink application (CPU, Memory, I/O)?
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
Data Virtualization means Real-time Data Access and Integration. But why do I need it? This presentation tries to answer it in a simple yet clear way.
By Alberto Pan, CTO of Denodo, and Justo Hidalgo, VP Product Management.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
Project Panthera is an open source effort that showcases better data analytics capabilities on Hadoop/HBase (e.g., better integration with existing infrastructure using SQL, better query processing on HBase, and efficiently utilizing new HW platform technologies). In this talk, we will discusses two new capabilities that we are currently working on under Project Panthera: (1) a SQL Engine for MapReduce (built on top of Hive) that supports common SQL constructs used in analytic queries, including some important features (e.g., sub-query in WHERE clauses, multiple-table SELECT statement, etc.) that are not supported in Hive today; (2) a Document-Oriented Store on HBase for better Hive/SQL query processing, which brings up-to 3x reduction in table storage and up-to 1.8x speedup in query processing.
Presenter: Jason Dai, Principal Engineer, Intel Software and Services Group
How Data-Driven Approaches are Changing Your Data Management Strategies
Introducing data-driven strategies into your business model alters the way your organization manages and provides information to your customers, partners and employees. Gone are the days of “waterfall” implementation strategies from relational data to applications within a data center. Now, data-driven business models require agile implementation of applications based on information from all across an organization–on-premises, cloud, and mobile–and includes information from outside corporate walls from partners, third-party vendors, and customers. Data management strategies need to be ready to meet these challenges or your new and disruptive business models will fail at the most critical time: when your customers want to access it.
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
How Rendezvous Architecture Improves Evaluation in the Real World
In this addition of our machine learning logistics webinar series we build on the ideas of the key requirements for effective management of machine learning logistics presented in the Overview webinar and in Part I Workshop. Here we focus on model-to-model comparison & evaluation, use of decoy models and more. Listen here: http://info.mapr.com/machine-learning-workshop2.html?_ga=2.35695522.324200644.1511891424-416597139.1465233415
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
MapR has launched the MapR Data Science Refinery which leverages a scalable data science notebook with native platform access, superior out-of-the-box security, and access to global event streaming and a multi-model NoSQL database.
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
Big data technologies are being applied to a wide variety of use cases. We will review tangible examples of machine learning, discuss an autonomous driving project and illustrate the role of MapR in next generation initiatives. More: http://info.mapr.com/WB_Machine-Learning-for-Chickens_Global_DG_17.11.02_RegistrationPage.html
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
Having heard the high-level rationale for the rendezvous architecture in the introduction to this series, we will now dig in deeper to talk about how and why the pieces fit together. In terms of components, we will cover why streams work, why they need to be persistent, performant and pervasive in a microservices design and how they provide isolation between components. From there, we will talk about some of the details of the implementation of a rendezvous architecture including discussion of when the architecture is applicable, key components of message content and how failures and upgrades are handled. We will touch on the monitoring requirements for a rendezvous system but will save the analysis of the recorded data for later. Listen to the webinar on demand: https://mapr.com/resources/webinars/machine-learning-workshop-1/
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
Join Ellen Friedman, co-author (with Ted Dunning) of a new short O’Reilly book Machine Learning Logistics: Model Management in the Real World, to look at what you can do to have effective model management, including the role of stream-first architecture, containers, a microservices approach and a DataOps style of work. Ellen will provide a basic explanation of a new architecture that not only leverages stream transport but also makes use of canary models and decoy models for accurate model evaluation and for efficient and rapid deployment of new models in production.
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
Data warehouses have been the standard tool for analyzing data created by business operations. In recent years, increasing data volumes, new types of data formats, and emerging analytics technologies such as machine learning have given rise to modern data lakes. Connecting application databases, data warehouses, and data lakes using real-time data pipelines can significantly improve the time to action for business decisions. More: http://info.mapr.com/WB_MapR-StreamSets-Data-Warehouse-Modernization_Global_DG_17.08.16_RegistrationPage.html
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
For this talk we will explore the power of streaming real time events in the context of the IoT and smart cities.
http://info.mapr.com/WB_Streaming-Real-Time-Events_Global_DG_17.08.02_RegistrationPage.html
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
Deploying storage with a forklift is so 1990s, right? Today’s applications and infrastructure demand systems and services that scale. Customers require performance and capacity to fit the use case and workloads, not the other way around. Architects need multi-temperature, multi-location, highly available, and compliance friendly platforms that grow with the generational shift in data growth and utility.
Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a service. Though originally used within the telecommunications industry, it has become common practice for banks, ISPs, insurance firms, and other verticals. More: http://info.mapr.com/WB_PredictingChurn_Global_DG_17.06.15_RegistrationPage.html
The prediction process is data-driven and often uses advanced machine learning techniques. In this webinar, we'll look at customer data, do some preliminary analysis, and generate churn prediction models – all with Spark machine learning (ML) and a Zeppelin notebook.
Spark’s ML library goal is to make machine learning scalable and easy. Zeppelin with Spark provides a web-based notebook that enables interactive machine learning and visualization.
In this tutorial, we'll do the following:
Review classification and decision trees
Use Spark DataFrames with Spark ML pipelines
Predict customer churn with Apache Spark ML decision trees
Use Zeppelin to run Spark commands and visualize the results
An Introduction to the MapR Converged Data PlatformMapR Technologies
Listen to the webinar on-demand: http://info.mapr.com/WB_Partner_CDP_Intro_EMEA_DG_17.05.31_RegistrationPage.html
In this 90-minute webinar, we discuss:
- The MapR Converged Data Platform and its components
- Use cases for the Converged Data Platform
- MapR Converged Partner Program
- How to get started with MapR
- Becoming a partner
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
IT budgets are shrinking, and the move to next-generation technologies is upon us. The cloud is an option for nearly every company, but just because it is an option doesn’t mean it is always the right solution for every problem.
Most cloud providers would prefer that every customer be tightly coupled with their proprietary services and APIs to create lock-in with that cloud provider. The savvy customer will leverage the cloud as infrastructure and stay loosely bound to a cloud provider. This creates an opportunity for the customer to execute a multicloud strategy or even a hybrid on-premises and cloud solution.
Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations. Along the way, Jim discusses security, backups, event streaming, databases, replication, and snapshots across a variety of use cases that run most businesses today.
Is your organization at the analytics crossroads? Have you made strides collecting and sharing massive amounts of data from electronic health records, insurance claims, and health information exchanges but found these efforts made little impact on efficiency, patient outcomes, or costs?
Changes in how business is done combined with multiple technology drivers make geo-distributed data increasingly important for enterprises. These changes are causing serious disruption across a wide range of industries, including healthcare, manufacturing, automotive, telecommunications, and entertainment. Technical challenges arise with these disruptions, but the good news is there are now innovative solutions to address these problems. http://info.mapr.com/WB_Geo-distributed-Big-Data-and-Analytics_Global_DG_17.05.16_RegistrationPage.html
MapR announced a few new releases in 2017, and we want to go over those exciting new products and features that are available now. We’d like to invite our customers and partners to this webinar in which members of the MapR product team will share details about the latest updates.
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
SAP® HANA and SAP® IQ are popular platforms for various analytical and transactional use cases. If you’re an SAP customer, you’ve experienced the benefits of deploying these solutions. However, as data volumes grow, you’re likely asking yourself: How do I scale storage to support these applications? How can I have one platform for various applications and use cases?
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
SAP HANA is an increasingly popular platform for various analytical and transactional use cases with its in-memory architecture. If you’re an SAP customer you’ve experienced the benefits.
However, the underlying storage for SAP HANA is painfully expensive. This slows down your ability to grow your SAP HANA footprint and serve up more applications.
You’re not the only one still loading your data into data warehouses and building marts or cubes out of it. But today’s data requires a much more accessible environment that delivers real-time results. Prepare for this transformation because your data platform and storage choices are about to undergo a re-platforming that happens once in 30 years.
With the MapR Converged Data Platform (CDP) and Cisco Unified Compute System (UCS), you can optimize today’s infrastructure and grow to take advantage of what’s next. Uncover the range of possibilities from re-platforming by intimately understanding your options for density, performance, functionality and more.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
2. Who
Am
I?
• Tomer
Shiran
• tshiran@maprtech.com
• Founding
Member
and
CommiGer,
Apache
Drill
• Director
of
Product
Management,
MapR
Technologies
3. Agenda
• Apache
Drill
overview
• Key
features
• Status
and
progress
• How
to
get
involved
4. Big
Data
Workloads
• ETL
• Data
mining
• Index
and
model
genera)on
• Clustering,
anomaly
detec)on
and
classifica)on
• Blob
store
• Lightweight
OLTP
on
large
datasets
• Web
crawling
• Stream
processing
• Interac)ve
analysis
5. Interac)ve
Queries
and
Hadoop
Exis)ng
Solu)ons
Compile
SQL
(HiveQL)
to
Export
MapReduce
results
to
MapReduce
RDBMS
and
query
RDBMS
Emerging
Technologies
Stinger/Tez Impala
Real-‐)me
PostgreSQL-‐based
PostgreSQL-‐based
Hive
performance
Real-‐)me
interac)ve
analysis
Hadoop
analy)cs
Hadoop
analy)cs
improvements
interac)ve
analysis
HAWQ Phoenix
Cascading Lingual
PostgreSQL-‐based
Compile
ANSI
SQL
to
Hadoop
analy)cs
SQL
layer
for
HBase
SQL
layer
for
HBase
MapReduce
6. Example
Problem
• Jane
works
as
an
Transac.on
analyst
at
an
e-‐ informa.on
commerce
company
• How
does
she
figure
User
out
good
targe)ng
profiles
segments
for
the
next
marke)ng
campaign?
• She
has
some
ideas
Access
and
lots
of
data
logs
7. Solving
the
Problem
with
Tradi)onal
Systems
• Use
an
RDBMS
– ETL
the
data
from
MongoDB
and
Hadoop
into
the
RDBMS
• MongoDB
data
must
be
flaGened,
schema)zed,
filtered
and
aggregated
• Hadoop
data
must
be
filtered
and
aggregated
– Query
the
data
using
any
SQL-‐based
tool
• Use
MapReduce
– ETL
the
data
from
Oracle
and
MongoDB
into
Hadoop
– Work
with
the
MapReduce
team
to
generate
the
desired
analyses
• Use
Hive
– ETL
the
data
from
Oracle
and
MongoDB
into
Hadoop
• MongoDB
data
must
be
flaGened
and
schema)zed
– But
HiveQL
is
limited,
queries
take
too
long
and
BI
tool
support
is
limited
8. WWGD
Distributed
Interac.ve
Batch
NoSQL
File
System
analysis
processing
GFS
BigTable
Dremel
MapReduce
Hadoop
HDFS
HBase
???
MapReduce
Build
Apache
Drill
to
provide
a
true
open
source
solu)on
to
interac)ve
analysis
of
Big
Data
9. Apache
Drill
Overview
• Interac)ve
analysis
of
Big
Data
using
standard
SQL
• Fast
– Low
latency
– Columnar
execu)on
• Inspired
by
Google
Dremel/BigQuery
Interac)ve
queries
– Complement
na)ve
interfaces
and
MapReduce/ Apache
Drill
Data
analyst
Hive/Pig
Repor)ng
100
ms-‐20
min
• Open
– Community
driven
open
source
project
– Under
Apache
Socware
Founda)on
• Modern
– Standard
ANSI
SQL:2003
(select/into)
Data
mining
– Nested/hierarchical
data
support
MapReduce
Modeling
Hive
– Schema
is
op)onal
Pig
Large
ETL
20
min-‐20
hr
– Supports
RDBMS,
Hadoop
and
NoSQL
10. How
Does
It
Work?
• Drillbits
run
on
each
node,
designed
to
maximize
data
locality
• Processing
is
done
outside
MapReduce
SELECT
*
FROM
paradigm
(YARN
is
supported)
oracle.transac)ons,
• Queries
can
be
fed
to
any
Drillbit
mongo.users,
hdfs.events
• Coordina)on,
query
planning,
op)miza)on,
LIMIT
1
scheduling,
and
execu)on
are
distributed
11. Key
Features
• Full
SQL
(ANSI
SQL:2003)
• Nested
data
• Schema
is
op)onal
• Flexible
and
extensible
architecture
12. Full
SQL
(ANSI
SQL:2003)
• Drill
supports
standard
ANSI
SQL:2003
– Correlated
subqueries,
analy)c
func)ons,
…
– SQL-‐like
is
not
enough
• Use
any
SQL-‐based
tool
with
Apache
Drill
– Tableau,
Microstrategy,
Excel,
SAP
Crystal
Reports,
Toad,
SQuirreL,
…
– Standard
ODBC
and
JDBC
drivers
Client
Tableau
Drillbit
MicroStrategy
Drill%ODBC% SQL%Query% Query%
Driver Drillbits
Driver Parser Planner Drill%Worker
Drill%Worker
Excel
SAP%Crystal%
Reports
13. Nested
Data
JSON
• Nested
data
is
becoming
prevalent
{
"name":
"Homer",
– JSON,
BSON,
XML,
Protocol
Buffers,
Avro,
…
"gender":
"Male",
"followers":
100
– The
data
source
may
or
may
not
be
aware
children:
[
• MongoDB
supports
nested
data
na)vely
{name:
"Bart"},
{name:
"Lisa”}
• A
single
HBase
value
could
be
a
JSON
document
]
(compound
nested
type)
}
– Google
Dremel’s
innova)on
was
efficient
columnar
storage
and
querying
of
nested
data
• FlaGening
nested
data
is
error-‐prone
and
ocen
Avro
enum
Gender
{
impossible
MALE,
FEMALE
– Think
about
repeated
and
op)onal
fields
at
every
}
level…
record
User
{
• Apache
Drill
supports
nested
data
string
name;
Gender
gender;
– Extensions
to
ANSI
SQL:2003
long
followers;
}
14. Schema
is
Op)onal
• Many
data
sources
do
not
have
rigid
schemas
– Schemas
change
rapidly
– Each
record
may
have
a
different
schema
• Sparse
and
wide
rows
in
HBase
and
Cassandra,
MongoDB
• Apache
Drill
supports
querying
against
unknown
schemas
– Query
any
HBase,
Cassandra
or
MongoDB
table
• User
can
define
the
schema
or
let
the
system
discover
it
automa)cally
– System
of
record
may
already
have
schema
informa)on
• Why
manage
it
in
a
separate
system?
– No
need
to
manage
schema
evolu)on
Row
Key
CF
contents
CF
anchor
"com.cnn.www"
contents:html
=
"<html>…"
anchor:my.look.ca
=
"CNN.com"
anchor:cnnsi.com
=
"CNN"
"com.foxnews.www"
contents:html
=
"<html>…"
anchor:en.wikipedia.org
=
"Fox
News"
…
…
…
15. Flexible
and
Extensible
Architecture
• Apache
Drill
is
designed
for
extensibility
– Well-‐documented
APIs
and
interfaces
• Data
sources
and
file
formats
– Implement
a
custom
scanner
to
support
a
new
data
source
or
file
format
• Query
languages
– SQL:2003
is
the
primary
language
– Implement
a
custom
Parser
to
support
a
Domain
Specific
Language
– UDFs
and
UDTFs
• Op)mizers
– Drill
will
have
a
cost-‐based
op)mizer
– Clear
surrounding
APIs
support
easy
op)mizer
explora)on
• Operators
– Custom
operators
can
be
implemented
• Special
operators
for
Mahout
(k-‐means)
being
designed
– Operator
push-‐down
to
data
source
(RDBMS)
16. What
About
Other
SQL-‐on-‐Hadoop
Systems?
• Strengths
– Code
is
more
mature
than
Apache
Drill
• Already
in
beta
(~2
quarters
ahead)
– Faster
than
Hive
on
some
queries
• Weaknesses
– Proprietary
or
semi-‐open
source
– Query
results
must
fit
in
memory
(no
spooling)
– Early
row
materializa)on
(no
columnar
execu)on)
– Some
are
SQL-‐like
(not
SQL)
– No
support
for
nested
data
– Rigid
schema
is
required
– Limited
flexibility
and
extensibility
– Only
support
Hadoop
and
HBase
(and
no
other
NoSQL
or
RDBMS)
17. Status:
In
Progress
• Heavy
ac)ve
development
– 6-‐7
companies
are
contribu)ng
• Available
– Logical
plan
syntax
and
interpreter
– Reference
execu)on
engine
• In
progress
– SQL
interpreter
– Storage
engine
implementa)ons
for
Accumulo,
Cassandra,
HBase
and
various
file
formats
• Significant
community
momentum
– Over
250
people
on
the
Drill
mailing
list
– Over
250
members
of
the
Bay
Area
Drill
User
Group
– Drill
meetups
across
the
US
and
Europe
– OpenDremel
team
joined
Apache
Drill
• An)cipated
schedule:
– Prototype:
Q1
– Alpha:
Q2
– Beta:
Q3
18. Why
Apache
Drill
Will
Be
Successful
Resources
Community
Architecture
• Contributors
have
strong
• Development
done
in
the
• Full
SQL
backgrounds
from
open
• New
data
support
companies
like
Oracle,
• Ac)ve
contributors
from
• Extensible
APIs
IBM
Netezza,
Informa)ca,
mul)ple
companies
• Full
Columnar
Execu)on
Clustrix
and
Pentaho
• Rapidly
growing
• Beyond
Hadoop
19. Interested
in
Apache
Drill?
• Many
op)ons
to
contribute
– Become
a
full
)me
Drill
engineer
@
MapR
• Email
tshiran@maprtech.com
– Join
the
Drill
mailing
list
and
start
contribu)ng
• JIRAs,
code,
unit
tests,
documenta)on,
…
– Shoot
me
an
email
and
we
can
discuss
• Email
tshiran@maprtech.com
21. Why
Not
Leverage
MapReduce?
• Scheduling
Model
– Coarse
resource
model
reduces
hardware
u)liza)on
– Acquisi)on
of
resources
typically
takes
100’s
of
millis
to
seconds
• Barriers
– Map
comple)on
required
before
shuffle/reduce
commencement
– All
maps
must
complete
before
reduce
can
start
– In
chained
jobs,
one
job
must
finish
en)rely
before
the
next
one
can
start
• Persistence
and
Recoverability
– Data
is
persisted
to
disk
between
each
barrier
– Serializa)on
and
deserializa)on
are
required
between
execu)on
phase