During the second half of 2016, IBM built a state of the art Hadoop cluster with the aim of running massive scale workloads. The amount of data available to derive insights continues to grow exponentially in this increasingly connected era, resulting in larger and larger data lakes year after year. SQL remains one of the most commonly used languages used to perform such analysis, but how do today’s SQL-over-Hadoop engines stack up to real BIG data? To find out, we decided to run a derivative of the popular TPC-DS benchmark using a 100 TB dataset, which stresses both the performance and SQL support of data warehousing solutions! Over the course of the project, we encountered a number of challenges such as poor query execution plans, uneven distribution of work, out of memory errors, and more. Join this session to learn how we tackled such challenges and the type of tuning that was required to the various layers in the Hadoop stack (including HDFS, YARN, and Spark) to run SQL-on-Hadoop engines such as Spark SQL 2.0 and IBM Big SQL at scale!
Speaker
Simon Harris, Cognitive Analytics, IBM Research
The annual review session by the AMIS team on their findings, interpretations and opinions regarding news, trends, announcements and roadmaps around Oracle's product portfolio. This presentation discusses architecture trends, container technology, disruptive movements such as IoT, Blockchain, Intelligent Bots and Machine Learning, Modern User Experience, Enterprise Integration, Autonomous Systems in general and Autonomous Database in particular, Security, Cloud, Networking, Java, High PaaS & Low PaaS, DevOps, Microservices, Hybrid Cloud. This Oracle OpenWorld - more than any in recent history - rocked the foundations of the Oracle platform and opened up some real new roads ahead. This presentation leads you through the most relevant announcements and new directions.
Presentación sobre la futura base de datos 18c, en la cual se incorpora todo lo mejor de las tecnologías Oracle, perfilando así una base de datos autónoma.
The annual review session by the AMIS team on their findings, interpretations and opinions regarding news, trends, announcements and roadmaps around Oracle's product portfolio. This presentation discusses architecture trends, container technology, disruptive movements such as IoT, Blockchain, Intelligent Bots and Machine Learning, Modern User Experience, Enterprise Integration, Autonomous Systems in general and Autonomous Database in particular, Security, Cloud, Networking, Java, High PaaS & Low PaaS, DevOps, Microservices, Hybrid Cloud. This Oracle OpenWorld - more than any in recent history - rocked the foundations of the Oracle platform and opened up some real new roads ahead. This presentation leads you through the most relevant announcements and new directions.
Presentación sobre la futura base de datos 18c, en la cual se incorpora todo lo mejor de las tecnologías Oracle, perfilando así una base de datos autónoma.
Enabling the Software Defined Data Center for Hybrid ITNetApp
Recently, NetApp held a Cloud Breakfast for customers of our High Touch Customer Program. This was a combined presentation from OBS, VMware and NetApp.
Presenters:
Jim Sangster, Senior Director, Solutions Marketing, NetApp - "Cloud for the Hybrid Data Center"
John Gilmartin, Vice President, Cloud Infrastructure Products, VMware - "Next Generation of IT"
Axel Haentjens Vice President, Marketing and International Orange Cloud for Business "NetApp Epic Story OBS"
Tim Waldron, Manager, Cloud Solutions, NetApp EMEA "Cloud Services – An EMEA Perspective"
The Future of Data Warehousing, Data Science and Machine LearningModusOptimum
Watch the on-demand recording here:
https://event.on24.com/wcc/r/1632072/803744C924E8BFD688BD117C6B4B949B
Evolution of Big Data and the Role of Analytics | Hybrid Data Management
IBM, Driving the future Hybrid Data Warehouse with IBM Integrated Analytics System.
Oracle Database Consolidation with FlexPod on Cisco UCSNetApp
Cisco and Oracle as technology front-runners provide YOU the tools you need to optimize your Oracle environments! John McAbel, Senior Product Manager - Oracle Solutions on UCS at Cisco Systems, explains how NetApp and Cisco are providing a flexible infrastructure that helps prepare organizations for today, and for future business growth and change.
Oracle GoldenGate Cloud Service OverviewJinyu Wang
The new PaaS solution in Oracle Public Cloud extends the real-time data replication from on-premises to cloud, and leads the innovation of real-time data movement with the powerful data streaming capability for enterprise solutions.
Cloud Innovation Day - Commonwealth of PA v11.3Eric Rice
Enhance and accelerate your path to digital innovation and transformation with IBM Cloud. Develop a roadmap to get started with cloud and incorporate best practices from other organizations just like yours.
Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks
You have a legacy system that no longer meet the demands of your current data needs, and replacing it isn’t an option. But don’t panic: Modernizing your traditional enterprise data warehouse is easier than you may think.
IBM Cloud Pak for Data is a single unified platform which helps to unify and simplify the collection, organization and analysis of data. Enterprises can turn data into insights through an integrated cloud-native architecture. IBM Cloud Pak for Data is extensible, easily customized to unique client data and AI landscapes through an integrated catalog of IBM, open source and third-party microservices add-ons
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)Lucas Jellema
The cloud is changing many things. Even the decision to not (yet) adopt the cloud is one to make explicitly. Now is a time for any organization to reconsider the IT landscape. For each system, we should make a conscious ruling on its roadmap. The 6R model suggests six ways to move a system forward. This session uses the 6R model and applies it specifically to Oracle technology-based systems: what are the options and considerations for Oracle Database, Oracle Fusion Middleware, custom applications and other red components. What future should we consider and how do we choose? The paths chosen by several Oracle-heavy users is presented to illustrate these options and the decision making process. Oracle Cloud Infrastructure and Autonomous Database play a role, as do Azure IaaS and Azure Managed Database as well as on-premises systems. Latency, recovery, scalability, licenses, automation, lock-in, skills and resources all make their appearance.
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?US-Analytics
In the third part of the series, we'll debunk myths around integrating DRM:
“It can’t automate or integrate with my non-Oracle products like SAP, Salesforce, Workday, or ServiceNow.”
“DRM doesn’t support a SaaS-based cloud architecture.”
“It doesn’t have delivered support for maintaining Oracle EPM products, like Essbase, Planning, HFM, and PBCS."
Enabling the Software Defined Data Center for Hybrid ITNetApp
Recently, NetApp held a Cloud Breakfast for customers of our High Touch Customer Program. This was a combined presentation from OBS, VMware and NetApp.
Presenters:
Jim Sangster, Senior Director, Solutions Marketing, NetApp - "Cloud for the Hybrid Data Center"
John Gilmartin, Vice President, Cloud Infrastructure Products, VMware - "Next Generation of IT"
Axel Haentjens Vice President, Marketing and International Orange Cloud for Business "NetApp Epic Story OBS"
Tim Waldron, Manager, Cloud Solutions, NetApp EMEA "Cloud Services – An EMEA Perspective"
The Future of Data Warehousing, Data Science and Machine LearningModusOptimum
Watch the on-demand recording here:
https://event.on24.com/wcc/r/1632072/803744C924E8BFD688BD117C6B4B949B
Evolution of Big Data and the Role of Analytics | Hybrid Data Management
IBM, Driving the future Hybrid Data Warehouse with IBM Integrated Analytics System.
Oracle Database Consolidation with FlexPod on Cisco UCSNetApp
Cisco and Oracle as technology front-runners provide YOU the tools you need to optimize your Oracle environments! John McAbel, Senior Product Manager - Oracle Solutions on UCS at Cisco Systems, explains how NetApp and Cisco are providing a flexible infrastructure that helps prepare organizations for today, and for future business growth and change.
Oracle GoldenGate Cloud Service OverviewJinyu Wang
The new PaaS solution in Oracle Public Cloud extends the real-time data replication from on-premises to cloud, and leads the innovation of real-time data movement with the powerful data streaming capability for enterprise solutions.
Cloud Innovation Day - Commonwealth of PA v11.3Eric Rice
Enhance and accelerate your path to digital innovation and transformation with IBM Cloud. Develop a roadmap to get started with cloud and incorporate best practices from other organizations just like yours.
Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks
You have a legacy system that no longer meet the demands of your current data needs, and replacing it isn’t an option. But don’t panic: Modernizing your traditional enterprise data warehouse is easier than you may think.
IBM Cloud Pak for Data is a single unified platform which helps to unify and simplify the collection, organization and analysis of data. Enterprises can turn data into insights through an integrated cloud-native architecture. IBM Cloud Pak for Data is extensible, easily customized to unique client data and AI landscapes through an integrated catalog of IBM, open source and third-party microservices add-ons
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)Lucas Jellema
The cloud is changing many things. Even the decision to not (yet) adopt the cloud is one to make explicitly. Now is a time for any organization to reconsider the IT landscape. For each system, we should make a conscious ruling on its roadmap. The 6R model suggests six ways to move a system forward. This session uses the 6R model and applies it specifically to Oracle technology-based systems: what are the options and considerations for Oracle Database, Oracle Fusion Middleware, custom applications and other red components. What future should we consider and how do we choose? The paths chosen by several Oracle-heavy users is presented to illustrate these options and the decision making process. Oracle Cloud Infrastructure and Autonomous Database play a role, as do Azure IaaS and Azure Managed Database as well as on-premises systems. Latency, recovery, scalability, licenses, automation, lock-in, skills and resources all make their appearance.
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
DRM Webinar Series, PART 3: Will DRM Integrate With Our Applications?US-Analytics
In the third part of the series, we'll debunk myths around integrating DRM:
“It can’t automate or integrate with my non-Oracle products like SAP, Salesforce, Workday, or ServiceNow.”
“DRM doesn’t support a SaaS-based cloud architecture.”
“It doesn’t have delivered support for maintaining Oracle EPM products, like Essbase, Planning, HFM, and PBCS."
Deliver Secure SQL Access for Enterprise APIs - August 29 2017Nishanth Kadiyala
This is a webinar we ran on August 29, 2017. 700+ users have registered for this webinar. In this webinar, Dipak Patel and Dennis Bennett talk about how companies can build SQL Access to their enterprise APIs.
Abstract:
Companies build numerous internal applications and complex APIs for enterprise data access. These APIs are often based on protocols such as REST or SOAP with payloads in XML or JSON and engineered for application developers. Today, however the enterprise data teams are trying to access this data for analytics which requires standard query capabilities and ability to surface metadata. As enterprises adopt new analytical and data management tools, a SQL access layer for this data becomes imperative. Many such enterprises from the Financial Services, Healthcare and Software industries are relying on our OpenAccess SDK to build a custom ODBC, JDBC, ADO.NET or OLEDB layer on top of their internal APIs and hosted multi-tenant databases.
Watch this webinar to learn:
1. Use cases for providing SQL access to your enterprise data
2. Learn how organizations provide SQL Access to its APIs
3. See a demo using DataDirect OpenAccess SDK to provide SQL Access for a REST API
4. Pitfalls and Best Practices to building a SQL Access
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)DataWorks Summit
Most organizations today implement different data stores to support business operations. As a result, data ends up stored across a multitude of often heterogenous systems, like RDBMS, NoSQL, data warehouses, data marts, Hadoop, etc., with limited interaction and/or interoperability between them. The end result is often a vast eco-system of data stores with different "temperature" data, some level of duplication and, no effective way of bringing it all together for business analytics. With such disparate data, how can an organization exploit the wealth of information? This opens up the need for proven techniques to quickly and easily deliver the data to the people who need it. In this session, you'll see how to modernize your enterprise by making data accessible with enterprise capabilities like querying using SQL, granular security for data access, and maintaining high query performance and high concurrency.
Spark working with a Cloud IDE: Notebook/Shiny AppsData Con LA
Abstract:-
The Problem: Energy inefficiency within public/private buildings in the City of New York.
The Goal: Take meter(Sensor) data, solve the inefficiencies through better insights.
The Solution: Visualization and Reporting through the Shiny App to gain knowledge in past, and present usage patterns. In addition to those patterns, compare and gain insights/predictions on energy usage.
Spark's Dataframes and RDD's will be used in concert with panda (library) to clean and model/prepare data for the R Shiny App. The message to convey in this meetup discussion is to show the capabilities of Spark while using DSX and RStudio/Shiny App to create visualization/reporting that will be able to give insights to the end user.
There are a few techniques that we will present in this notebook with both modeling and ML: Linear Regression, K-Means clustering for identifying inefficient buildings, (Statistical) Classification Modeling, followed by a confusion matrix (error matrices).
Bio:-
Thomas Liakos has been an Open Source Systems Engineer for 11 years and he has 8 years of experience in Cloud and hybrid environments. Prior to IBM Thomas was at Gem.co: Sr. Systems Architect. and CrowdStrike: DevOps / Systems Engineer - Cloud Operations. Thomas has expertise in Spark, Python, Systems and Configuration Management, Architecture, Data Warehousing, and Data Engineering.
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...ModusOptimum
Customers are looking for ways to streamline analytic decisioning, looking for quicker deployments, faster time to value, lower risks of failure and higher revenues/profits. The IBM & Hortonworks solution delivers on these customer needs.
https://event.on24.com/eventRegistration/EventLobbyServlet?target=reg20.jsp&eventid=1789452&sessionid=1&eventid=1789452&sessionid=1&mode=preview&key=E0F94DE1191C59223B6522A075023215
Un approccio completo di tipo cognitivo comprende tre componenti: un metodo, un ecosistema e una piattaforma. In questa sessione scopriremo come realizzare questo approccio grazie anche a Watson Data Platform, che aiuta i data scientist e gli esperti di business analytics a far “lavorare i dati” in un’ottica cognitive. In questo modo si può dare impulso alla crescita e al cambiamento aziendale. Ci concentreremo sulla possibilità di analizzare i dati provenienti dai Social Media per valutare la percezione dell’Amministrazione da parte di studenti, genitori, stampa, blogger…
Al cuore della soluzione ci sono una serie di servizi disegnati per funzione aziendale (sviluppatori, data scientist, data engineers, comunicazione / marketing) e la capacità di imparare propria della tecnologia cognitiva, che completano l’architettura e aiutano a “comporre” nuove soluzioni di business.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
*TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC)
Some of these challenges might sound familiar
Moderning your existing EDW without long and costly migration efforts
Querying across multiple types of data management
Lack ability to query across traditional data warehouse and cloud based data systems
Lack of skill set to migrate data from RDBMS to Hadoop/Hive
Optimzing and integrating external data sources with existing data sources
Offloading and porting workloads from Oracle, Db2 & Netezza
Difficult to offload workloads from EDW platforms
Performance
Slows down once too many users access the system
Interactive query performance is unacceptable
Operational efficiency around older data warehouse environments
Requires tools to help with ease of use and automation to manage workloads and schedule jobs
4 things to remember
Compatible with Oracle, Db2, Netezza
Modernizing EDW Workloads on Hadoop
App portability
Federates data behind a single SQL engine
Uses connectors to Teradata, Oracle, Db2, Netezza
Addresses a skill gap needed to migrated technologies
People hate migrations on rewrite code – interrupts business process – takes other resources to do this work
Skills gap to do this. Want an engine that is 100% compatible
With Big SQL – you can go into your accounts with Netezza, Oracle, DB2 – they can move part of their RBDMS Dataware house to hadoop without having to change any code
Ask your customer – do you want to optimize your oracle, netezza, db2 workloads by moving them to Hive
Sick of Oracle, Netezza? We got a solution to help you get them out.
Finally – it delivers that high performance for complex SQL workloads
Big SQL is just an alternate execution engine that uses the same Hive storage model and integrates with the Hive metastore. In fact, Big SQL won’t work without the Hive metastore.
Instead of MapReduce, Big SQL uses a native C/C++ MPP engine. Applications can choose to connect to Hive, or Big SQL – they both co-exist on the Hadoop platform. In the following slides, we’ll cover the benefits of using Big SQL’s execution engine over MapReduce.
Federation consists of the following: a federated server, federated database, data sources, clients which can be users and applications to access both local database and database from data sources.Federation is known for its strengths:- Transparent: Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database
- Extensible: Update data in relational data sources, as if the data is stored in the federated database
- Autonomous: Move data to and from relational data sources without interruptions
- High Function: Take advantage of the data source processing strengths, by sending requests to the data sources for processing
- High Performance: Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server
Below is a list of data sources supported for Big SQL. For the latest information, visit
https://www-304.ibm.com/support/entdocview.wss?uid=swg27044495
Data Source Supported Versions Notes
DB2 for z/OS® 8.x, 9.x, 10.x
DB2®DB2 for LUW. 9.7, 9.8, 10.1, 10.5
Oracle 11g, 11gR1, 11g R2, 12c
Teradata 12, 13, 14, 15 Not supported on POWER systems.
Netezza 4.6, 5.0, 6.0, 7.2 Not supported on POWER systems.
Informix 11.5
Microsoft SQL Server 2012, 2014
ODBC 3.0 or later
Big SQL now comes pre-bundled with DataDirect drivers from Progress that eliminates the need for downloading drivers and enables easy setup
Spark connectors enhances Big SQL’s connectivity to other data sources and also operationalize machine learning models
Federation consists of the following: a federated server, federated database, data sources, clients which can be users and applications to access both local database and database from data sources.Federation is known for its strengths:- Transparent: Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database
- Extensible: Update data in relational data sources, as if the data is stored in the federated database
- Autonomous: Move data to and from relational data sources without interruptions
- High Function: Take advantage of the data source processing strengths, by sending requests to the data sources for processing
- High Performance: Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server
Below is a list of data sources supported for Big SQL. For the latest information, visit
https://www-304.ibm.com/support/entdocview.wss?uid=swg27044495
Data Source Supported Versions Notes
DB2 for z/OS® 8.x, 9.x, 10.x
DB2®DB2 for LUW. 9.7, 9.8, 10.1, 10.5
Oracle 11g, 11gR1, 11g R2, 12c
Teradata 12, 13, 14, 15 Not supported on POWER systems.
Netezza 4.6, 5.0, 6.0, 7.2 Not supported on POWER systems.
Informix 11.5
Microsoft SQL Server 2012, 2014
ODBC 3.0 or later
Big SQL is ANSI compliant, therefore it can run Oracle SQL, DB2 SQL and Netezza SQL.
Big SQL tables can be created for data residing in HDFS, Hive, Hbase, Object Store and WebHDFS.
Big SQL is aware of remote indexes and table statistics to optimize federated queries.
Big SQL provides:
a unified view of all your tables, with federated query support to external databases
stored optimally as Hive, HBase, or read with Spark - optimized for the expected workload
secured under a single security model (including row/column security across all)
with the ability to join across all datasets using standard ANSI SQL across all types of tables
with Oracle, NZ, DB2 extensions if you prefer
using a single database connection and driver
No other SQL engine for Hadoop even comes close.
For Albert, from Branch A, only rows from his branch are listed because he has access to records with same branch name as his.
Bonnie, from Branch B, can only view records from her branch and does have access to view salary.
But Cindy, a security administrator, can view records from all branches and also view all columns.
When customers take the time to review the security capabilities of other SQL Engines on Hadoop, they realize a lot of things they are used to are missing. Big SQL supports all of the important features.
Separation of Duties – an important feature if you really want to operationalize the environment. Most enterprise customers don’t like having a database “super user” who can manage the environment and also see all data (like root access for the SQL engine). Big SQL can define Database Administrator roles that allow administration of the environment without permitting access to the data, for example.
Big SQL also provide row level and column level security using row permissions (CREATE PERMISSION) or column masking (CREATE MASK) on the tables.
Note that both Big SQL ROLES (same concept as DB2 or Oracle roles) and OS Groups (Local and LDAP) are supported.
*TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC)
Four broad types of queries in the set of 99. Reporting queries, Ad-hoc queries, Iterative OLAP queries, Data mining queries. Minimum of 4 concurrent users running all 99 queries to publish a valid result for an official TPC-DS benchmark.
Results presented here are not an official TPC-DS benchmark result. But running TPC-DS queries has become the de-facto benchmark for SQL over Hadoop engines.
Data Prep:
Generate 100TB raw text (CSV) by the tool provided in the TPCDS toolkit
Load text into Parquet tables (compression is on with snappy)
Summary of cluster tuning…
Tables partitioned on:
_SALES tables partitioned on *_SOLD_DATE_SK
_RETURNS tables partitioned on *_RETURN_DATE_SK
Spark SQL made impressive strides in version 2.0 to be able to run all 99 TPC-DS queries out of the box (with allowed MQM rewrites ) at 1GB and 1TB. However, these tests show that as the volume of data grows beyond 1TB, Spark 2.1 struggles complete all 99 queries.
Spark SQL was able to complete the following 83 queries at 100TB: 1,2,3,4,5,7,8,9,10,11,12,13,17,18,19,21,22,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,76,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,96,97,99
For an apples to apples comparison, the 83 queries which Spark could successfully complete were executed in a single stream run on both Big SQL and Spark SQL: 1,2,3,4,5,7,8,9,10,11,12,13,17,18,19,21,22,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,76,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,96,97,99
For an apples to apples comparison, the 83 queries which Spark could successfully complete were executed in 4 concurrent streams on both Big SQL and Spark SQL: 1,2,3,4,5,7,8,9,10,11,12,13,17,18,19,21,22,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,55,56,57,58,59,60,61,62,63,65,66,67,68,69,70,71,72,73,76,78,79,80,81,82,83,85,86,87,88,89,90,91,92,93,96,97,99
First – let’s look at efficiency. Big SQL actually used less average CPU compared to Spark SQL.
User CPU time is time spent on the processor running your program's code (or code in libraries); System CPU time is the time spent running code in the operating system kernel on behalf of your program. Another way to look at system CPU time is it’s the time the processor worked on operating system's functions connected to that specific program (e.g. forking a process / process management)
In general, System CPU is like wasted CPU cycles – so you can see that Big SQL is roughly 3X more efficient in this regard.
Charts shows average memory consumption across each node in the cluster whilst the 4-stream workload was running.
Charts show average and max i/o throughput per node during the 4-stream test.
Results of the comparison summarized in a single slide.
There is still no substitute for product maturity, and the strength of Big SQL comes from its heritage. The core of Big SQL is built around IBM’s Relational Database technology which has been running in production systems for 25 plus years. As such, Big SQL has a hardened, mature, and very efficient SQL optimizer and runtime – this is the key to its world class scalability and performance. It’s this lineage which raises Big SQL above other vendors in the SQL over Hadoop space.
Main points:
This slide summarizes the results of the Big SQL and Hive evaluation.
Big SQL 5.0.1 completed the Hadoop-DS 6-stream concurrency test 2.3 times faster than Hive 2.1, both engines using 85 queries.
Big SQL can run all 99 queries in the Hadoop-DS/TPC-DS workload, but for this exercise we compared Big SQL and Hive using a subset with 85 queries. 14 queries were dropped from the query set because we were unable to come up with compliant and working versions of these queries for Hive. By “compliant”, we mean any changes required to get the query to run adhere to the “minor query modifications” allowed by the TPC-DS specification. By “working”, we mean the query completed successfully.
Scale factor was 10 TB. The data was loaded into the ORC file format with Zlib compression..
We installed HDP 2.6.1 which included Hive 2.1. We installed Big SQL 5.0.1 on top of that. We ran Hive 2.1 with LLAP on TEZ.
For both Big SQL and Hive, the slowest query was query 67. With 6 streams, the workload consisted of 85 x 6 = 510 total queries. For 80% of the queries (out of 510 total queries), Big SQL was faster.
Big SQL used 1.5 times fewer CPU cycles to complete the workload. Big SQL consumed a larger percentage of available CPU during the runs, but since Big SQL finished sooner, it actually completed the workload using less total CPU cycles.
This evaluation was done on a 17-Node cluster (Lenovo x3650 M4’s) with 1 management node and 16 compute nodes. The numbers for cores, memory, and disk space shown here are the totals across the 16 compute nodes. The network was 10 gigabit ethernet.
Both the Hive and Big SQL tables were partitioned.
There is still no substitute for product maturity, and the strength of Big SQL comes from its heritage. The core of Big SQL is built around IBM’s Relational Database technology which has been running in production systems for 25 plus years. As such, Big SQL has a hardened, mature, and very efficient SQL optimizer and runtime – this is the key to its world class scalability and performance. It’s this lineage which raises Big SQL above other vendors in the SQL over Hadoop space.
The proof points will be proved during the course of this presentation
The great thing about Hadoop is that there are so many great open source engines available. Rather than trying to re-invent storage options, Big SQL exploits what is there…. HBase is another engine that can provide rapid, scalable, lookups with update and delete support.
NM = YARN Node Manager
This is a high level architectural view of how Big SQL integrates with YARN through Slider. The Slider Client is a general purpose client for applications like Big SQL to interoperate with YARN. The client defines a set of APIs that are to be implemented by the application (Big SQL).
So the red cube/package represents a slider package included with Big SQL that implements the slider APIs. When we start up Big SQL, the slider client negotiates resource allocation with YARN and the Big SQL workers are now started in containers.
[click]
If we want to free up some resources back to YARN (not much demand for Big SQL at the moment/time of day), we use slider to stop a subset (or all) Big SQL workers. The memory formerly allocated to Big SQL workers are now available to other workloads in Hadoop.
Instead of running one (potentially very large memory) worker per host, elastic boost enables the ability to run multiple smaller Big SQL workers per host.
We can start and stop workers fully independent of each other. Therefore, while the same amount of maximum memory/CPU is consumed when all workers are on…. we get much more granularity of resource usage in terms of scaling up and down capacity with more workers.
Cluster was designed for Spark SQL; hence it has lots of memory, fast SSDs and a high bandwidth network.
Almost half of the failed Spark SQL queries can be classified as complex – those which do deep analytics. Indicating Spark SQL struggles most with the complex queries at larger scale factors.
100TB some queries returned huge amounts of data (millions of rows, no fetch only caluse) so 40GB results
Spread the local dirs onto data disks, comma-delimited
Lots of tasks in lots of stages creating a huge eventqueue, must increase its capacity
0.8 for memory.fraction: Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. Higher value from 0.6 to 0.8 will help reduce spills.
Chart shows the settings tuned for Big SQL.
Those default values for the properties highlighted in green will be changed in v4.3 – meaning these 5 values will not require further tuning.
INSTANCE_MEMORY and DB2_CPU_BINDING are configured to allow Big SQL to use 97% of memory on each node (the other 3% being used for OS and other Hadoop components, such as HDFS Namenode), and 94% of CPUs on each node to BigSQL. 94% was used as it allowed 2 cores to be reserved for OS and the HDFS Namenode.
DFT_DEGREE was dialled down because parallelism is achieved by defining multiple Big SQL workers per node – so the need for SMP parallelism is reduced.
Actual settings for sort space are in 4k pages:
SORTHEAP 1146880 ; SHEAPTHRES_SHR 18350080
And for Bufferpools in 32k pages:
BP 491520
MLN_RANDOMIZED scheduler allocation algorithm was specifically designed for environments with multiple Big SQL workers per node.
Table shows TPC-DS based benchmarks published by SQL over Hadoop vendors during 2016.
https://blog.cloudera.com/blog/2016/09/apache-impala-incubating-vs-amazon-redshift-s3-integration-elasticity-agility-and-cost-performance-benefits-on-aws/
http://blog.cloudera.com/blog/2016/08/bi-and-sql-analytics-with-apache-impala-incubating-in-cdh-5-8-3x-faster-on-secure-clusters/
https://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/
http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
http://radiantadvisors.com/whitepapers/sqlonhadoopbenchmark2016/
https://issuu.com/radiantadvisors/docs/radiantadvisors_sql-on-hadoop_bench
*1 Radiant Advisors benchmark sponsored by Teradata
*2 Only one of the 24 queries tested references a fact table other than store_sales. So can this really be claimed as a 15TB benchmark if the vast majority of queries reference only approx. ½ of the data set.
HDB/HAWQ are not in the chart because they published no new benchmarks in 2016.Vertica claim to run 98 of the queries in a presentation from last August at HPE Big Data conference. Data volume, release levels, and configuration were not specified and so it is not included in the table.