This document discusses using Red Hat JBoss Middleware and Hortonworks to enable a modern data architecture. It provides an overview of Red Hat and JBoss Middleware and describes three use cases: 1) combining data from Hadoop with traditional sources using data virtualization, 2) federating across geographically distributed Hadoop clusters with data security, and 3) creating virtual data marts for a Hadoop data lake.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
This document discusses Hortonworks Data Platform (HDP) for Windows. It includes an agenda for the presentation which covers an introduction to HDP for Windows, integrating HDP with Microsoft tools, and a demo. The document lists the speakers and provides information on Windows support for Hadoop components. It describes what is included in HDP for Windows, such as deployment choices and full interoperability across platforms. Integration with Microsoft tools like SQL Server, Excel, and Power BI is highlighted. A demo of using Excel to interact with HDP is promised.
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
This document provides an introduction to Microsoft Azure HDInsight, including:
- An overview of HDInsight and how it is Microsoft's Hadoop distribution running in the cloud based on Hortonworks Data Platform.
- The architecture of HDInsight and how it is tightly integrated with Microsoft's technology stack.
- Examples of use cases for HDInsight like iterative data exploration, data warehousing on demand, and ETL automation.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
20100806 cloudera 10 hadoopable problems webinarCloudera, Inc.
Jeff Hammerbacher introduced 10 common problems that are suitable for solving with Hadoop. These include modeling true risk, customer churn analysis, recommendation engines, ad targeting, point of sale transaction analysis, analyzing network data to predict failures, threat analysis, trade surveillance, search quality, and using Hadoop as a data sandbox. Many of these problems involve analyzing large and complex datasets from multiple sources to discover patterns and relationships.
Neustar is a fast growing provider of enterprise services in telecommunications, online advertising, Internet infrastructure, and advanced technology. Neustar has engaged Think Big Analytics to leverage Hadoop to expand their data analysis capacity. This session describes how Hadoop has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. We look at the challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
This document discusses Hortonworks Data Platform (HDP) for Windows. It includes an agenda for the presentation which covers an introduction to HDP for Windows, integrating HDP with Microsoft tools, and a demo. The document lists the speakers and provides information on Windows support for Hadoop components. It describes what is included in HDP for Windows, such as deployment choices and full interoperability across platforms. Integration with Microsoft tools like SQL Server, Excel, and Power BI is highlighted. A demo of using Excel to interact with HDP is promised.
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
This document provides an introduction to Microsoft Azure HDInsight, including:
- An overview of HDInsight and how it is Microsoft's Hadoop distribution running in the cloud based on Hortonworks Data Platform.
- The architecture of HDInsight and how it is tightly integrated with Microsoft's technology stack.
- Examples of use cases for HDInsight like iterative data exploration, data warehousing on demand, and ETL automation.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
20100806 cloudera 10 hadoopable problems webinarCloudera, Inc.
Jeff Hammerbacher introduced 10 common problems that are suitable for solving with Hadoop. These include modeling true risk, customer churn analysis, recommendation engines, ad targeting, point of sale transaction analysis, analyzing network data to predict failures, threat analysis, trade surveillance, search quality, and using Hadoop as a data sandbox. Many of these problems involve analyzing large and complex datasets from multiple sources to discover patterns and relationships.
Richard McDougall discusses trends in big data and frameworks for building big data applications. He outlines the growth of data, how big data is driving real-world benefits, and early adopter industries. McDougall also summarizes batch processing frameworks like Hadoop and Spark, graph processing frameworks like Pregel, and real-time processing frameworks like Storm. Finally, he discusses interactive processing frameworks such as Hive, Impala, and Shark and how to unify the big data platform using virtualization.
1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets.
2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance.
3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
VMworld 2013
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
The document provides an overview of big data analytics using Hadoop. It discusses how Hadoop allows for distributed processing of large datasets across computer clusters. The key components of Hadoop discussed are HDFS for storage, and MapReduce for parallel processing. HDFS provides a distributed, fault-tolerant file system where data is replicated across multiple nodes. MapReduce allows users to write parallel jobs that process large amounts of data in parallel on a Hadoop cluster. Examples of how companies use Hadoop for applications like customer analytics and log file analysis are also provided.
Big Data comes from a variety of sources as human activities online generate vast amounts of data every day through intentional, accidental, and unknown means. This includes activities on social media, sensors, logs, and more. Content delivery networks (CDNs) can help distribute big data by caching content on servers located closer to users. While pushing content to CDNs offloads work from origin servers and improves performance, it also segments users and requires replication strategies to maintain consistency. Techniques include pre-computing static content from dynamic sources, pushing searches and other functions to CDNs, and experimenting with different cache models. Overall, CDNs can be an effective way to distribute big data but also introduce more complexity and dependence on the CDN
Architecting Virtualized Infrastructure for Big DataRichard McDougall
This document discusses architecting virtualized infrastructure for big data. It notes that data is growing exponentially and that the value of data now exceeds hardware costs. It advocates using virtualization to simplify and optimize big data infrastructure, enabling flexible provisioning of workloads like Hadoop, SQL, and NoSQL clusters on a unified analytics cloud platform. This platform leverages both shared and local storage to optimize performance while reducing costs.
Ibm big dataibm marriage of hadoop and data warehousingDataWorks Summit
This document discusses IBM's Big Data platform and the marriage of Hadoop and data warehousing. It covers how Big Data is driving new use cases across enterprises due to the 3Vs of volume, velocity and variety. It also discusses how Hadoop and data warehousing complement each other by providing massively parallel processing for analytics on all types of data at scale. The emergence of the Hadoop data warehouse is examined as the next generation Big Data platform that can provide timely insights from both structured and unstructured data.
Thousands of unsecured Hadoop clusters have been targets of attacks where criminals have deleted databases and files. According to reports, over 5,000 Hadoop installations were accessible on port 50070 without authentication, allowing attackers to destroy data nodes and snapshots containing terabytes of data within seconds. A study found nearly 4,500 servers with the Hadoop Distributed File System exposed over 5 petabytes of data. Many of these unsecured systems have likely already been compromised by attackers destroying data.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
The document discusses big data and MapReduce frameworks like Hadoop. It provides an overview of MapReduce and how it allows distributed processing of large datasets using simple map and reduce functions. The document also covers several common design patterns for MapReduce jobs, including filtering, sorting, joins, and computing statistics.
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
This document discusses virtualizing Hadoop for the enterprise. It begins with discussing trends driving changes in enterprise IT like cloud, mobile apps, and big data. It then discusses how Hadoop can address big, fast, and flexible data needs. The rest of the document discusses how virtualizing Hadoop through solutions like Project Serengeti can provide enterprises with elasticity, high availability, and operational simplicity for their Hadoop implementations. It also discusses how virtualization allows enterprises to integrate Hadoop with other workloads and data platforms.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
Richard McDougall discusses trends in big data and frameworks for building big data applications. He outlines the growth of data, how big data is driving real-world benefits, and early adopter industries. McDougall also summarizes batch processing frameworks like Hadoop and Spark, graph processing frameworks like Pregel, and real-time processing frameworks like Storm. Finally, he discusses interactive processing frameworks such as Hive, Impala, and Shark and how to unify the big data platform using virtualization.
1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets.
2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance.
3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
VMworld 2013
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
The document provides an overview of big data analytics using Hadoop. It discusses how Hadoop allows for distributed processing of large datasets across computer clusters. The key components of Hadoop discussed are HDFS for storage, and MapReduce for parallel processing. HDFS provides a distributed, fault-tolerant file system where data is replicated across multiple nodes. MapReduce allows users to write parallel jobs that process large amounts of data in parallel on a Hadoop cluster. Examples of how companies use Hadoop for applications like customer analytics and log file analysis are also provided.
Big Data comes from a variety of sources as human activities online generate vast amounts of data every day through intentional, accidental, and unknown means. This includes activities on social media, sensors, logs, and more. Content delivery networks (CDNs) can help distribute big data by caching content on servers located closer to users. While pushing content to CDNs offloads work from origin servers and improves performance, it also segments users and requires replication strategies to maintain consistency. Techniques include pre-computing static content from dynamic sources, pushing searches and other functions to CDNs, and experimenting with different cache models. Overall, CDNs can be an effective way to distribute big data but also introduce more complexity and dependence on the CDN
Architecting Virtualized Infrastructure for Big DataRichard McDougall
This document discusses architecting virtualized infrastructure for big data. It notes that data is growing exponentially and that the value of data now exceeds hardware costs. It advocates using virtualization to simplify and optimize big data infrastructure, enabling flexible provisioning of workloads like Hadoop, SQL, and NoSQL clusters on a unified analytics cloud platform. This platform leverages both shared and local storage to optimize performance while reducing costs.
Ibm big dataibm marriage of hadoop and data warehousingDataWorks Summit
This document discusses IBM's Big Data platform and the marriage of Hadoop and data warehousing. It covers how Big Data is driving new use cases across enterprises due to the 3Vs of volume, velocity and variety. It also discusses how Hadoop and data warehousing complement each other by providing massively parallel processing for analytics on all types of data at scale. The emergence of the Hadoop data warehouse is examined as the next generation Big Data platform that can provide timely insights from both structured and unstructured data.
Thousands of unsecured Hadoop clusters have been targets of attacks where criminals have deleted databases and files. According to reports, over 5,000 Hadoop installations were accessible on port 50070 without authentication, allowing attackers to destroy data nodes and snapshots containing terabytes of data within seconds. A study found nearly 4,500 servers with the Hadoop Distributed File System exposed over 5 petabytes of data. Many of these unsecured systems have likely already been compromised by attackers destroying data.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
The document discusses big data and MapReduce frameworks like Hadoop. It provides an overview of MapReduce and how it allows distributed processing of large datasets using simple map and reduce functions. The document also covers several common design patterns for MapReduce jobs, including filtering, sorting, joins, and computing statistics.
The document discusses the rise of Big Data as a Service (BDaaS) and how recent technological advancements have enabled its emergence. It provides a brief history of Hadoop and how improvements in networking, storage, virtualization and containers have addressed earlier limitations. It defines BDaaS and describes the public cloud and on-premises deployment models. Finally, it highlights how BlueData's software platform can deliver an integrated BDaaS solution both on-premises and across multiple public clouds including AWS.
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
This document discusses virtualizing Hadoop for the enterprise. It begins with discussing trends driving changes in enterprise IT like cloud, mobile apps, and big data. It then discusses how Hadoop can address big, fast, and flexible data needs. The rest of the document discusses how virtualizing Hadoop through solutions like Project Serengeti can provide enterprises with elasticity, high availability, and operational simplicity for their Hadoop implementations. It also discusses how virtualization allows enterprises to integrate Hadoop with other workloads and data platforms.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
Red Hat's document discusses using JBoss Data Virtualization to gain better insights from big data. It describes challenges with existing data integration approaches as data sources grow in size, type and location. Red Hat's big data strategy is to reduce the information gap by making all data easily consumable for analytics. JBoss Data Virtualization software virtually unifies data across sources and exposes it to applications through standard interfaces. The demonstration shows integrating social media sentiment data from Hadoop with sales data from MySQL to analyze movie ticket and merchandise sales.
Big data insights with Red Hat JBoss Data VirtualizationKenneth Peeples
You’re hearing a lot about big data these days. And big data and the technologies that store and process it, like Hadoop, aren’t just new data silos. You might be looking to integrate big data with existing enterprise information systems to gain better understanding of your business. You want to take informed action.
During this session, we’ll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data. You’ll learn how Red Hat JBoss Data Virtualization:
Can help you integrate your existing and growing data infrastructure.
Integrates big data with your existing enterprise data infrastructure.
Lets non-technical users access big data result sets.
We’ll also provide typical uses cases and examples and a demonstration of the integration of Hadoop sentiment analysis with sales data.
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir
Hadoop is a quickly budding ecosystem of components based on Google’s MapReduce algorithm and file system work for implementing MapReduce[3] algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process large volumes of data and analyze it in ways not previously possible with SQL-based approaches or less scalable solutions. Remarkable improvements in conventional compute and storage resources help make Hadoop clusters feasible for most organizations. This paper begins with the discussion of Big Data [1][7][9] evolution and the future of Big Data based on Gartner’s Hype Cycle. We have explained how Hadoop Distributed File System (HDFS) works and its architecture with suitable illustration. Hadoop’s MapReduce paradigm for distributing a task across multiple nodes in Hadoop is discussed with sample data sets. The working of MapReduce and HDFS when they are put all together is discussed. Finally the paper ends with a discussion on Big Data Hadoop sample use cases which shows how enterprises can gain a competitive benefit by being early adopters of big data analytics. Hadoop Distributed File System (HDFS) is the core component of Apache Hadoop project. In HDFS, the computation is carried out in the nodes where relevant data is stored. Hadoop also implemented a parallel computational paradigm named as Map-Reduce. In this paper, we have measured the performance of read and write operations in HDFS by considering small and large files. For performance evaluation, we have used a Hadoop cluster with five nodes. The results indicate that HDFS performs well for the files with the size greater than the default block size and performs poorly for the files with the size less than the default block size.
VMworld 2013
Chris Greer, FedEx
Richard McDougall, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
Virtualizing big data platforms like Hadoop provides organizations with agility, elasticity, and operational simplicity. It allows clusters to be quickly provisioned on demand, workloads to be independently scaled, and mixed workloads to be consolidated on shared infrastructure. This reduces costs while improving resource utilization for emerging big data use cases across many industries.
This document discusses using Red Hat JBoss Data Virtualization to gain better insights from big data. It describes how data challenges are getting bigger with the growth of big data, cloud, and mobile. Data virtualization software can virtually unify fragmented data across sources and make it available to applications as a single data source. The demo scenario shows how JBoss Data Virtualization is used to mashup sentiment analysis data from Hive with sales data from MySQL to determine if sentiment is a predictor of sales. A live demo then demonstrates integrating these different data sources through a JBoss Data Virtualization virtual data model.
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
Integration intervention: Get your apps and data up to speedKenneth Peeples
SOA has been the defacto methodology for enterprise application and process integration, because loosely coupled components and composite applications are more agile and efficient. The perfect solution? Not quite.
The data’s always been the problem. The most efficient and agile applications and services can be dragged down by the point-to-point data connections of a traditional data integration stack. Virtualized data services can eliminate the friction and get your applications up to speed.
In this webinar we'll show you how to (replay at http://www.redhat.com/en/about/events/integration-intervention-get-your-apps-and-data-speed):
-Quickly and easily create a virtual data services layer to plug data into your SOA infrastructure for an agile and efficient solution
-Derive more business value from your services.
1. The document discusses security considerations for deploying big data as a service (BDaaS) across multiple tenants and applications. It focuses on maintaining a single user identity to prevent data duplication and enforce access policies consistently.
2. It describes using Apache Ranger to centrally define and enforce policies across Hadoop services like HDFS, HBase, Hive. Ranger integrates with LDAP/AD for authentication.
3. The key challenge is propagating user identities from the application layer to the data layer. This can be done by connecting HDFS directly via Kerberos or using a "super-user" that impersonates other users when accessing HDFS.
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Abhiraj Butala
The talk covers limitations of current Hadoop eco-system components in handling security (Authentication, Authorization, Auditing) in multi-tenant, multi-application environments. Then it proposes how we can use Apache Ranger and HDFS super-user connections to enforce correct HDFS authorization policies and achieve the required auditing.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
Analysis of historical movie data by BHADRABhadra Gowdra
Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
Kenneth Peeples, a JBoss technology evangelist, presented on better business results through open source integration. The presentation included an overview of open source integration with data virtualization, a preview of JBoss middleware integration offerings, and integration examples to help attendees get started. Specifically, it provided information on Red Hat's data virtualization, messaging, integration/ESB, and service design products and how they can help organizations innovate faster through open hybrid cloud environments. It also presented sample use cases and implementations including big data integration with data virtualization and a travel triage application.
This document discusses Hadoop and its relationship to Microsoft technologies. It provides an overview of what Big Data is, how Hadoop fits into the Windows and Azure environments, and how to program against Hadoop in Microsoft environments. It describes Hadoop capabilities like Extract-Load-Transform and distributed computing. It also discusses how HDFS works on Azure storage and support for Hadoop in .NET, JavaScript, HiveQL, and Polybase. The document aims to show Microsoft's vision of making Hadoop better on Windows and Azure by integrating with technologies like Active Directory, System Center, and SQL Server. It provides links to get started with Hadoop on-premises and on Windows Azure.
Similar to Red Hat - Presentation at Hortonworks Booth - Strata 2014 (20)
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
The HDF 3.3 release delivers several exciting enhancements and new features. But, the most noteworthy of them is the addition of support for Kafka 2.0 and Kafka Streams.
https://hortonworks.com/webinar/hortonworks-dataflow-hdf-3-3-taking-stream-processing-next-level/
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
Forrester forecasts* that direct spending on the Internet of Things (IoT) will exceed $400 Billion by 2023. From manufacturing and utilities, to oil & gas and transportation, IoT improves visibility, reduces downtime, and creates opportunities for entirely new business models.
But successful IoT implementations require far more than simply connecting sensors to a network. The data generated by these devices must be collected, aggregated, cleaned, processed, interpreted, understood, and used. Data-driven decisions and actions must be taken, without which an IoT implementation is bound to fail.
https://hortonworks.com/webinar/iot-predictions-2019-beyond-data-heart-iot-strategy/
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
Cloudbreak, a part of Hortonworks Data Platform (HDP), simplifies the provisioning and cluster management within any cloud environment to help your business toward its path to a hybrid cloud architecture.
https://hortonworks.com/webinar/getting-data-cloud-cloudbreak-live-demo/
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
In this webinar, we talk with experts from Johns Hopkins as they share techniques and lessons learned in real-world Apache Hadoop implementation.
https://hortonworks.com/webinar/johns-hopkins-using-hadoop-securely-access-log-events/
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
Cybersecurity today is a big data problem. There’s a ton of data landing on you faster than you can load, let alone search it. In order to make sense of it, we need to act on data-in-motion, use both machine learning, and the most advanced pattern recognition system on the planet: your SOC analysts. Advanced visualization makes your analysts more efficient, helps them find the hidden gems, or bombs in masses of logs and packets.
https://hortonworks.com/webinar/catch-hacker-real-time-live-visuals-bots-bad-guys/
We have introduced several new features as well as delivered some significant updates to keep the platform tightly integrated and compatible with HDP 3.0.
https://hortonworks.com/webinar/hortonworks-dataflow-hdf-3-2-release-raises-bar-operational-efficiency/
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
With the growth of Apache Kafka adoption in all major streaming initiatives across large organizations, the operational and visibility challenges associated with Kafka are on the rise as well. Kafka users want better visibility in understanding what is going on in the clusters as well as within the stream flows across producers, topics, brokers, and consumers.
With no tools in the market that readily address the challenges of the Kafka Ops teams, the development teams, and the security/governance teams, Hortonworks Streams Messaging Manager is a game-changer.
https://hortonworks.com/webinar/curing-kafka-blindness-hortonworks-streams-messaging-manager/
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
The healthcare industry—with its huge volumes of big data—is ripe for the application of analytics and machine learning. In this webinar, Hortonworks and Quanam present a tool that uses machine learning and natural language processing in the clinical classification of genomic variants to help identify mutations and determine clinical significance.
Watch the webinar: https://hortonworks.com/webinar/interpretation-tool-genomic-sequencing-data-clinical-environments/
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
Last year IBM and Hortonworks jointly announced a strategic and deep partnership. Join us as we take a close look at the partnership accomplishments and the conjoined road ahead with industry-leading analytics offers.
View the webinar here: https://hortonworks.com/webinar/ibmhortonworks-transformation-big-data-landscape/
The document provides an overview of Apache Druid, an open-source distributed real-time analytics database. It discusses Druid's architecture including segments, indexing, and nodes like brokers, historians and coordinators. It also covers integrating Druid with Hortonworks Data Platform for unified querying and visualization of streaming and historical data.
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
Gaining business advantages from big data is moving beyond just the efficient storage and deep analytics on diverse data sources to using AI methods and analytics on streaming data to catch insights and take action at the edge of the network.
https://hortonworks.com/webinar/accelerating-data-science-real-time-analytics-scale/
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
Thanks to sensors and the Internet of Things, industrial processes now generate a sea of data. But are you plumbing its depths to find the insight it contains, or are you just drowning in it? Now, Hortonworks and Seeq team to bring advanced analytics and machine learning to time-series data from manufacturing and industrial processes.
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
Trimble Transportation Enterprise is a leading provider of enterprise software to over 2,000 transportation and logistics companies. They have designed an architecture that leverages Hortonworks Big Data solutions and Machine Learning models to power up multiple Blockchains, which improves operational efficiency, cuts down costs and enables building strategic partnerships.
https://hortonworks.com/webinar/blockchain-with-machine-learning-powered-by-big-data-trimble-transportation-enterprise/
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
For years, the healthcare industry has had problems of data scarcity and latency. Clearsense solved the problem by building an open-source Hortonworks Data Platform (HDP) solution while providing decades worth of clinical expertise. Clearsense is delivering smart, real-time streaming data, to its healthcare customers enabling mission-critical data to feed clinical decisions.
https://hortonworks.com/webinar/delivering-smart-real-time-streaming-data-healthcare-customers-clearsense/
Making Enterprise Big Data Small with EaseHortonworks
Every division in an organization builds its own database to keep track of its business. When the organization becomes big, those individual databases grow as well. The data from each database may become silo-ed and have no idea about the data in the other database.
https://hortonworks.com/webinar/making-enterprise-big-data-small-ease/
Driving Digital Transformation Through Global Data ManagementHortonworks
Using your data smarter and faster than your peers could be the difference between dominating your market and merely surviving. Organizations are investing in IoT, big data, and data science to drive better customer experience and create new products, yet these projects often stall in ideation phase to a lack of global data management processes and technologies. Your new data architecture may be taking shape around you, but your goal of globally managing, governing, and securing your data across a hybrid, multi-cloud landscape can remain elusive. Learn how industry leaders are developing their global data management strategy to drive innovation and ROI.
Presented at Gartner Data and Analytics Summit
Speaker:
Dinesh Chandrasekhar
Director of Product Marketing, Hortonworks
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
Hortonworks DataFlow (HDF) is the complete solution that addresses the most complex streaming architectures of today’s enterprises. More than 20 billion IoT devices are active on the planet today and thousands of use cases across IIOT, Healthcare and Manufacturing warrant capturing data-in-motion and delivering actionable intelligence right NOW. “Data decay” happens in a matter of seconds in today’s digital enterprises.
To meet all the needs of such fast-moving businesses, we have made significant enhancements and new streaming features in HDF 3.1.
https://hortonworks.com/webinar/series-hdf-3-1-technical-deep-dive-new-streaming-features/
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
Join the Hortonworks product team as they introduce HDF 3.1 and the core components for a modern data architecture to support stream processing and analytics.
You will learn about the three main themes that HDF addresses:
Developer productivity
Operational efficiency
Platform interoperability
https://hortonworks.com/webinar/series-hdf-3-1-redefining-data-motion-modern-data-architectures/
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
The document discusses Apache NiFi and streaming change data capture (CDC) with Attunity Replicate. It provides an overview of NiFi's capabilities for dataflow management and visualization. It then demonstrates how Attunity Replicate can be used for real-time CDC to capture changes from source databases and deliver them to NiFi for further processing, enabling use cases across multiple industries. Examples of source systems include SAP, Oracle, SQL Server, and file data, with targets including Hadoop, data warehouses, and cloud data stores.
Manyata Tech Park Bangalore_ Infrastructure, Facilities and Morenarinav14
Located in the bustling city of Bangalore, Manyata Tech Park stands as one of India’s largest and most prominent tech parks, playing a pivotal role in shaping the city’s reputation as the Silicon Valley of India. Established to cater to the burgeoning IT and technology sectors
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...The Third Creative Media
"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio, Inc.
Alluxio Webinar
June. 18, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jianjian Xie (Staff Software Engineer, Alluxio)
As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of compute and storage creates latency challenges when querying datasets; scanning data between storage and compute tiers becomes I/O bound. On the other hand, cloud API costs related to GET/LIST operations and cross-region data transfer add up quickly.
The newly introduced Trino file system cache by Alluxio aims to overcome the above challenges. In this session, Jianjian will dive into Trino data caching strategies, the latest test results, and discuss the multi-level caching architecture. This architecture makes Trino 10x faster for data lakes of any scale, from GB to EB.
What you will learn:
- Challenges relating to the speed and costs of running Trino in the cloud
- The new Trino file system cache feature overview, including the latest development status and test results
- A multi-level cache framework for maximized speed, including Trino file system cache and Alluxio distributed cache
- Real-world cases, including a large online payment firm and a top ridesharing company
- The future roadmap of Trino file system cache and Trino-Alluxio integration
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
Photoshop Tutorial for Beginners (2024 Edition)alowpalsadig
Photoshop Tutorial for Beginners (2024 Edition)
Explore the evolution of programming and software development and design in 2024. Discover emerging trends shaping the future of coding in our insightful analysis."
Here's an overview:Introduction: The Evolution of Programming and Software DevelopmentThe Rise of Artificial Intelligence and Machine Learning in CodingAdopting Low-Code and No-Code PlatformsQuantum Computing: Entering the Software Development MainstreamIntegration of DevOps with Machine Learning: MLOpsAdvancements in Cybersecurity PracticesThe Growth of Edge ComputingEmerging Programming Languages and FrameworksSoftware Development Ethics and AI RegulationSustainability in Software EngineeringThe Future Workforce: Remote and Distributed TeamsConclusion: Adapting to the Changing Software Development LandscapeIntroduction: The Evolution of Programming and Software Development
Photoshop Tutorial for Beginners (2024 Edition)Explore the evolution of programming and software development and design in 2024. Discover emerging trends shaping the future of coding in our insightful analysis."Here's an overview:Introduction: The Evolution of Programming and Software DevelopmentThe Rise of Artificial Intelligence and Machine Learning in CodingAdopting Low-Code and No-Code PlatformsQuantum Computing: Entering the Software Development MainstreamIntegration of DevOps with Machine Learning: MLOpsAdvancements in Cybersecurity PracticesThe Growth of Edge ComputingEmerging Programming Languages and FrameworksSoftware Development Ethics and AI RegulationSustainability in Software EngineeringThe Future Workforce: Remote and Distributed TeamsConclusion: Adapting to the Changing Software Development LandscapeIntroduction: The Evolution of Programming and Software Development
The importance of developing and designing programming in 2024
Programming design and development represents a vital step in keeping pace with technological advancements and meeting ever-changing market needs. This course is intended for anyone who wants to understand the fundamental importance of software development and design, whether you are a beginner or a professional seeking to update your knowledge.
Course objectives:
1. **Learn about the basics of software development:
- Understanding software development processes and tools.
- Identify the role of programmers and designers in software projects.
2. Understanding the software design process:
- Learn about the principles of good software design.
- Discussing common design patterns such as Object-Oriented Design.
3. The importance of user experience (UX) in modern software:
- Explore how user experience can improve software acceptance and usability.
- Tools and techniques to analyze and improve user experience.
4. Increase efficiency and productivity through modern development tools:
- Access to the latest programming tools and languages used in the industry.
- Study live examples of applications
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
Liberarsi dai framework con i Web Component.pptxMassimo Artizzu
In Italian
Presentazione sulle feature e l'utilizzo dei Web Component nell sviluppo di pagine e applicazioni web. Racconto delle ragioni storiche dell'avvento dei Web Component. Evidenziazione dei vantaggi e delle sfide poste, indicazione delle best practices, con particolare accento sulla possibilità di usare web component per facilitare la migrazione delle proprie applicazioni verso nuovi stack tecnologici.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
Transforming Product Development using OnePlan To Boost Efficiency and Innova...OnePlan Solutions
Ready to overcome challenges and drive innovation in your organization? Join us in our upcoming webinar where we discuss how to combat resource limitations, scope creep, and the difficulties of aligning your projects with strategic goals. Discover how OnePlan can revolutionize your product development processes, helping your team to innovate faster, manage resources more effectively, and deliver exceptional results.
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...kalichargn70th171
Visual testing plays a vital role in ensuring that software products meet the aesthetic requirements specified by clients in functional and non-functional specifications. In today's highly competitive digital landscape, users expect a seamless and visually appealing online experience. Visual testing, also known as automated UI testing or visual regression testing, verifies the accuracy of the visual elements that users interact with.
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...kalichargn70th171
In today's fiercely competitive mobile app market, the role of the QA team is pivotal for continuous improvement and sustained success. Effective testing strategies are essential to navigate the challenges confidently and precisely. Ensuring the perfection of mobile apps before they reach end-users requires thoughtful decisions in the testing plan.
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
Red Hat - Presentation at Hortonworks Booth - Strata 2014
1. Discover
Red
Hat
and
Hortonworks
for
the
Modern
Data
Architecture
Kimberly
Palko
Product
Manager
Red
Hat
1 RED
HAT
JBOSS
MIDDLEWARE
2. 2 RED
HAT
JBOSS
MIDDLEWARE
Agenda
● Red Hat and JBoss Middleware Overview
● Combining data in Hadoop with traditional data
sources
● Federating two geographically distributed
Hadoop clusters
● Virtual data marts for Hadoop Lake
3. RED
HAT
&
JBOSS
MIDDLEWARE
OVERVIEW
3 RED
HAT
JBOSS
MIDDLEWARE
4. Engineering
CollaboraFon
Benefits
Integra<on
with
JBoss
Data
Virtualiza<on
Enable
agile
Big
Data
Hadoop
integra<on
with
exis<ng
enterprise
assets
and
maximize
universal
data
u<liza<on
to
enable
self-‐service
analy<cs
4 RED
HAT
JBOSS
MIDDLEWARE
Integra<on
with
mul<ple
Red
Hat
JBoss
Middleware
product
family
Enables
millions
of
JBoss
developers
to
quickly
build
applica<ons
with
Hadoop
Integra<on
with
Red
Hat
Storage
Enables
Hadoop
to
use
Red
Hat
Storage
secure
resilient
storage
pool
for
data
applica<ons
Integra<on
with
Red
Hat
Enterprise
Linux
OpenStack
PlaOorm
Simplifies
automated
deployment
of
Hadoop
on
OpenStack
Integrated
with
Red
Hat
Enterprise
Linux
and
OpenJDK
Develop
and
deploy
Apache
Hadoop
as
an
integrated
component
for
mul<ple
deployment
scenarios
5. Big
Data
Integra<on:
Turn
Data
into
Ac<onable
Informa<on
Speed
of
Itera<on
leads
to
Success
Semi
/
Unstructured
Data
5 RED
SOCIAL,
LOGS
HAT
JBOSS
MIDDLEWARE
Hadoop
&
NoSQL
Data
Integra<on
&
Data
Services
JBoss
Data
Virtualiza<on
In-‐memory
data
management
JBoss
Data
Grid
BI
Analy<cs
(diagnos<c,
descrip<ve,
predic<ve,
prescrip<ve)
SOA
Applica<ons
Event
Processing
&
Messaging
JBoss
BRMS
&
JBoss
A-‐MQ
Structured
Data
DW,
OLAP,
OLTP
Streaming
Data
EVENTS,
IOT
Red
Hat
Enterprise
Linux
Red
Hat
Storage
Analyze
Integrate
Enrich
Ingest
6. Data
Challenges
Geang
Bigger…
HBase
6 RED
HAT
JBOSS
MIDDLEWARE
NoSQL
Hive
MapReduce
HDFS
Storm
Spark
7. Make
Big
Data
Accessible
for
Everyone
7 RED
HAT
JBOSS
MIDDLEWARE
8. Data Supply and Integration Solution
Data
Virtualiza<on
sits
in
front
of
mul<ple
data
sources
and
! allows
them
to
be
treated
a
single
source
8 RED
HAT
JBOSS
MIDDLEWARE
! delivering
the
desired
data
! in
the
required
form
! at
the
right
<me
! to
any
applica<on
and/or
user.
THINK
VIRTUAL
MACHINE
FOR
DATA
9. Easy
Access
to
Big
Data
Hive
9 RED
● Repor<ng
tool
accesses
the
data
virtualiza<on
server
via
HAT
JBOSS
MIDDLEWARE
rich
SQL
dialect
● The
data
virtualiza<on
server
translates
rich
SQL
dialect
to
HiveQL
● Hive
translates
HiveQL
to
MapReduce
● MapReduce
runs
MR
job
on
big
data
MapReduce
HDFS
Analytical
Reporting
Tool
Data
Virtualization
Server
Hadoop
Big Data
10. Different
Users
Different
Views
of
Big
Data
Hive
10 RED
● Logical
tables
with
different
forms
of
aggrega<on
● Logical
tables
containing
extra
derived
data
● Logical
tables
with
filtered
data
● All
reports/users
share
the
same
specifica<ons
HAT
JBOSS
MIDDLEWARE
MapReduce
HDFS
11. USE
CASE
1:
COMBINING
DATA
FROM
HADOOP
WITH
TRADITIONAL
SOURCES
-‐
USING
JBOSS
DATA
VIRTUALIZATION
11 RED
HAT
JBOSS
MIDDLEWARE
12. Integra<on
of
Big
Data
with
“Small
Data”
12 RED
HAT
JBOSS
MIDDLEWARE
• Integra<ng
small
data
with
big
data
is
easy
• Integra<on
specifica<ons
can
be
shared
or
be
developed
for
individual
reports
MapReduce
HDFS
Database
Hive
Applica<on
Server
13. Hive
13 RED
HAT
JBOSS
MIDDLEWARE
Caching
the
Big
Data
• Caches
to
speed
up
interac<ve
repor<ng
• Caches
to
create
a
consistent
view
of
big
data
• Different
caches
for
different
reports
MapReduce
HDFS
14. 14 RED
HAT
JBOSS
MIDDLEWARE
USE
CASE
2:
GEOGRAPHICALLY
DISTRIBUTED HADOOP
CLUSTERS WITH DATA
VIRTUALIZATION
- SECURING DATA BY USER ROLE
15. Role based access control
15 RED
HAT
JBOSS
MIDDLEWARE
Roles
• Define
roles
based
on
organiza<on
hierarchy
Users
• External
authen<ca<on
via
Kerberos,
LDAP,
etc.
VDB
• Assign
users
and
groups
to
a
virtual
data
base
16. 16 RED
HAT
JBOSS
MIDDLEWARE
Authentication
Kerberos
From
client
to
the
virtual
data
base
Login
Modules
LDAP
(MS
Ac<ve
Directory,
OpenLDAP,
etc.),
any
JAAS
based
security
domain
REST
and
Web
Services
WS-‐UsernameToken
HTTP
Basic
authen<ca<on
SAML
SAML
authen<ca<on
for
web
client
applica<ons
18. Row
and
Column
Masking
18 RED
-‐ Row
based
masking
Ex:
keyed
off
geographic
marker
-‐
Column
masking
to
a
constant,
null,
or
a
SQL
statement
Example:
change
all
but
the
Last
4
digits
in
a
credit
card
number
to
stars
concat('****',
substring(column,
length(column)-‐4))
HAT
JBOSS
MIDDLEWARE
19. Summary
of
Security
Capabili<es
● Authentication
– Kerberos, LDAP, WS-UsernameToken, HTTP Basic,
SAML
19 RED
HAT
JBOSS
MIDDLEWARE
● Authorization
– Virtual data views, Role based access control
● Administration
– Centralized management of VDB privileges
● Audit
– Centralized audit logging and dashboard
● Protection
– Row and column masking
– SSL encryption (ODBC and JDBC)
21. Use Case 2: Federating across
Geographically Distributed
Hadoop Clusters
Problem:
Geographically distributed Hadoop
clusters contains sensitive data like
patient records or customer
identification that cannot be
accessed by other regions due to
regulatory policy. IT needs access
to all data, but users can only
access the data in their region.
21 RED
HAT
JBOSS
MIDDLEWARE
Solution:
Leverage JBoss Data Virtualization to
provide Row Level Security and
Masking of columns while
federating across Hadoop clusters.
Data
can
be
accessed
by
mulFple
tools
and
methods
already
in-‐house
Consume
Compose
Connect
JBoss
Data
Virtualiza<on
Hiv
e
Hadoop
cluster
in
one
geographic
region
Hiv
e
Hadoop
cluster
in
a
second
geographic
region
22. Use Case 2 - Architecture
APPLICATIONS
22 RED
HAT
JBOSS
MIDDLEWARE
DATA
SYSTEM
Business
AnalyFcs
Custom
ApplicaFons
Packaged
ApplicaFons
VIRTUAL
DATA
MART
23. Use Case 2 - Resources
23 RED
HAT
JBOSS
MIDDLEWARE
• GUIDE
How
to
guide:
https://github.com/DataVirtualizationByExample/
HortonworksUseCase2
Tutorial:
Available
soon
• VIDEOS:
hpp://vimeo.com/user16928011/hortonworksusecase2short
hpp://vimeo.com/user16928011/hortonworksusecase2short
• SOURCE:
hpps://github.com/DataVirtualiza<onByExample/HortonworksUseCase2
24. 24 RED
HAT
JBOSS
MIDDLEWARE
USE
CASE
3:
VIRTUAL DATA
MARTS FOR HADOOP DATA
LAKE
- WITH JBOSS DATA VIRTUALIZATION
25. Data for entire organization in Hadoop Data Lake
Problem:
How
does
IT
control
access
and
give
business
users
just
the
data
they
need?
-‐
Does
every
line
of
business
have
access
to
everyone’s
data?
-‐
How
do
business
users
get
access
to
the
data
they
need
in
a
simple
(even
self-‐service)
way?
Hadoop
Data
Lake
HR
Employee
Files
25 RED
HAT
JBOSS
MIDDLEWARE
Marke<ng
Clickstream
Data
Finance
Expense
Reports
Server
Logs
Sales
Transac<ons
Customer
Twiper
Accounts
Sen<ment
Data
26. Secure, Self-Service Virtual Data Marts for Hadoop
SoluFon:
Use
JBoss
Data
VirtualizaFon
to
create
virtual
data
marts
on
top
of
a
Hadoop
cluster
-‐ Lines
of
Business
get
access
to
the
data
they
need
in
a
simple
manner
-‐ IT
maintains
the
process
and
control
it
needs
-‐ All
data
remains
in
the
data
lake,
nothing
is
copied
or
moved
Marke<ng
Finance
IT
Hadoop
Data
Lake
26 RED
HAT
JBOSS
MIDDLEWARE
Marke<ng
Clickstream
Data
Customer
Twiper
Accounts
Sen<ment
Data
Sales
Server
Logs
HR
Employee
Sales
Transac<ons
Files
Finance
Expense
Reports
27. Optional hierarchical data architectures with virtual
data mart
Can be combined with security features like user role
access and row and column masking
Dept
Base
Virtual
Database
(VDB)
27 RED
HAT
JBOSS
MIDDLEWARE
Team
1
VDB
Team2
VDB
View1
View2
28. Virtual Data Marts for Operational Data
Problem:
All
the
legacy
and
archived
data
is
in
the
Hadoop
data
lake.
We
want
to
access
the
most
recent,
up
to
the
minute,
operaFonal
data
oen
and
quickly.
Hadoop
Data
Lake
Historical
Data
HR
Employee
Files
28 RED
HAT
JBOSS
MIDDLEWARE
Marke<ng
Clickstream
Data
Finance
Expense
Reports
Server
Logs
Sales
Transac<ons
Customer
Accounts
Twiper
Sen<ment
Data
29. Caching
For
Faster
Performance
–
Materialized
View
Query
1
29 RED
HAT
JBOSS
MIDDLEWARE
Cached
or
Materialized
View
1
View
1
Query
2
Virtual
Database
(VDB)
• Same
cached
view
for
mul<ple
queries
• Refreshed
automa<cally
or
manually
• Cache
repository
can
be
any
supported
data
source
30. Virtual operational data store
SoluFon:
Use
JBoss
Data
VirtualizaFon
to
integrate
up
to
the
minute
data
from
mulFple
diverse
data
sources
that
can
be
quickly
queried.
-‐
Use
HDP
for
older
data
Materialized
View
30 RED
Hadoop
Data
Lake
HR
Employee
Files
HAT
JBOSS
MIDDLEWARE
-‐
-‐
Use
JDV
to
materialize
the
data
in
HDP
for
-‐
faster
access
and
to
combine
with
operaFonal
VDB
-‐
Marke<ng
Clickstream
Data
Finance
Expense
Reports
Server
Logs
Sales
Transac<ons
Customer
Accounts
Twiper
Sen<ment
Data
Opera<onal
Historical
Data
VDB
with
up
to
the
minute
data
Periodic
Transfer
from
Data
Sources
32. Use Case 3 - Overview
xxx ObjecFve:
32 RED
HAT
JBOSS
MIDDLEWARE
–Purpose
oriented
data
views
for
func<onal
teams
over
a
rich
variety
of
semi-‐structured
and
structured
data
Problem:
–Data
Lakes
have
large
volumes
of
consolidated
clickstream
data,
product
and
customer
data
that
need
to
be
constrained
for
mul<-‐
departmental
use.
SoluFon:
–Leverage
HDP
to
mashup
Clickstream
analysis
data
with
product
and
customer
data
on
HDP
to
answer
-‐
Leverage
Jboss
Data
Virt
to
provide
Virtual
data
marts
for
Marke<ng
and
Product
teams
33. Use Case 3 - Architecture
33 RED
HAT
JBOSS
MIDDLEWARE
APPLICATIONS
Business
AnalyFcs
Custom
ApplicaFons
Packaged
ApplicaFons
DATA
SYSTEM
SOURCES
Emerging
Sources
(Sensor,
SenFment,
Geo,
Unstructured)
ExisFng
Sources
(CRM,
ERP,
Clickstream,
Logs)
HDP
2.1
Governance
& Integration
Security
Operations
Data Access
Data
Management
VIRTUAL
DATA
MART
34. Use Case 3 - Resources
• GUIDE
How to guide: https://github.com/DataVirtualizationByExample/
HortonworksUseCase3
Tutorial: Available soon
• VIDEOS:
http://vimeo.com/user16928011/hwxuc3configuration
http://vimeo.com/user16928011/hwxuc3run
http://vimeo.com/user16928011/hwxuc3overview
• SOURCE:
https://github.com/DataVirtualizationByExample/HortonworksUseCase3
34 RED
HAT
JBOSS
MIDDLEWARE
36. Use Case 1: Combine data from
Hadoop with traditional data
sources
Problem:
Data from new data sources like
social media, clickstream and
sensors needs to be combined
with data from traditional sources
to get the full value.
36 RED
HAT
JBOSS
MIDDLEWARE
Solution:
Leverage JBoss Data Virtualization
to mashup new data in Hadoop
with data in traditional data
sources without moving or
copying any data and access it
through a variety of BI tools and
SOA technologies.
Data
can
be
accessed
by
mulFple
tools
and
methods
already
in-‐house
Consume
Compose
Connect
JBoss
Data
Virtualiza<on
Hiv
e
SOURCE
1:
Hive/Hadoop
contains
data
from
new
data
sources
like
social
media,
clickstream
and
sensor
data
SOURCE
2:
TradiFonal
relaFonal
databases
in
the
enterprise
37. Use Case 1 - Architecture
RDBMS
EDW
MPP
37 RED
HAT
JBOSS
MIDDLEWARE
DATA
SYSTEM
TRADITIONAL
REPOSITORIES
APPLICATIONS
Business
AnalyFcs
Custom
ApplicaFons
Packaged
ApplicaFons
VIRTUAL
DATA
MART
39. Use Case 1 - Resources
http://hortonworks.com/hadoop-tutorial/evolving-data-stratagic-
asset-using-hdp-red-hat-jboss-data-virtualization/
39 RED
HAT
JBOSS
MIDDLEWARE
40. Benefits
of
Data
Virtualiza<on
on
Big
Data
● Enterprise
democra<za<on
of
big
data
● Any
repor<ng
or
analy<cal
tool
can
be
used
40 RED
HAT
JBOSS
MIDDLEWARE
● Easy
access
to
big
data
● Seamless
integra<on
of
big
data
and
small
data
● Sharing
of
integra<on
specifica<ons
● Collabora<ve
development
on
big
data
● Fine-‐grained
security
of
big
data
● Speedy
delivery
of
reports
on
big
data