Slides from May 2018 St. Louis Big Data Innovations, Data Engineering, and Analytics User Group meeting. The presentation focused on Data Modeling in Hive.
Ranger provides centralized authorization policies for Hadoop resources. A new Ranger Hive Metastore security agent was introduced to address gaps in authorizing Hive CLI and synchronizing policies with object changes. It enforces authorization for Hive CLI, synchronizes access control lists when objects are altered via DDL in HiveServer2, and provides consistent authorization between Hive and HDFS resources. The agent has been implemented through custom classes extending Hive authorization hooks. A demo showed the new capabilities working with Ranger, Hive, and Hadoop.
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...Amazon Web Services Korea
기업 환경에 따라 차이는 있겠지만, 최근 대부분의 기업은 데이터 분석 환경이 구축되어 있고, 이를 기반으로 데이터를 분석하고 있습니다. 그럼에도 불구하고 현업에서는 분석하고자 하는 데이터가 없거나 변화하는 비즈니스 요건을 반영하지 못한다는 불만을 제기하고, 분석 환경을 제공하는 IT운영팀은 변화하는 비즈니스 요건에 따라 분석 환경을 적시에 제공하기 쉽지 않다는 어려움을 토로하고 있습니다. 이 해결책으로 운영시스템에 데이터베이스 형태로 존재하고 있거나, 현업의 PC에서 수작업으로 작성한 정형, 비정형 파일을 통합 관리할 수 있고, 또한 인프라 환경의 확장 및 변경을 보다 유연하게 할 수 있는 AWS Cloud 기반의 분석 환경 구축 사례를 소개하고자 합니다.
다시보기 링크: https://youtu.be/YvYfNZHMJkI
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
The document summarizes the evolution of Netflix's big data platform to meet the challenges of their growing scale. Key points include:
- Netflix now has over 65 million members in over 50 countries and supports over 1000 devices. They stream over 10 billion hours of content per quarter.
- Their traditional business intelligence stack could no longer meet the demands of scale. They transitioned to using AWS services like S3 for storage and open source tools like Kafka, Cassandra, and Parquet to enable real-time analytics and machine learning on their massive data volumes.
- Netflix has adopted an open source-first strategy and contributes back to the community as their own tools evolve to meet processing needs and achieve the necessary scale to
S3 Versioning allows multiple versions of objects to be stored in a single S3 bucket. When versioning is enabled for a bucket, S3 automatically assigns a unique version ID to each object upload or change. This prevents accidental overwriting and enables restoration of previous versions. Versioning is configured at the bucket level and version IDs are opaque strings assigned by S3 to distinguish object versions.
Modern Machine Learning (ML) represents everything as vectors, from documents, to videos, to user behavior. This representation makes it possible to accurately search, retrieve, rank, and classify different items by similarity and relevance.
Running real-time applications that rely on large numbers of such high dimensional vectors requires a new kind of data infrastructure. In this talk we will discuss the need for such infrastructure, the algorithmic and engineering challenges in working with vector data at scale, and open problems we still have no adequate solutions for.
Time permitting, I will introduce Pinecone as a solution to some of these challenges.
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
The document describes the recommended Iceberg workflow which includes 8 steps:
1) Create Iceberg tables from existing datasets or sample datasets
2) Batch insert data to prepare for time travel scenarios
3) Create security policies for fine-grained access control
4) Build BI queries for reporting
5) Build visualizations from query results
6) Perform time travel queries to audit changes
7) Optimize partition schemas to improve query performance
8) Manage and expire snapshots for table maintenance
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
Watch this talk here: https://www.confluent.io/online-talks/bridge-to-cloud-apache-kafka-migrate-gcp
Most companies start their cloud journey with a new use case, or a new application. Sometimes these applications can run independently in the cloud, but often times they need data from the on premises datacenter. Existing applications will slowly migrate, but will need a strategy and the technology to enable a multi-year migration.
In this session, we will share how companies around the world are using Confluent Cloud, a fully managed Apache Kafka® service, to migrate to Google Cloud Platform. By implementing a central-pipeline architecture using Apache Kafka to sync on-prem and cloud deployments, companies can accelerate migration times and reduce costs.
Register now to learn:
-How to take the first step in migrating to GCP
-How to reliably sync your on premises applications using a persistent bridge to cloud
-How Confluent Cloud can make this daunting task simple, reliable and performant
마이크로서비스 아키텍처로 만들어진 현대 애플리케이션에서는 관계형 데이터베이스 이외에도 각 마이크로서비스의 특징에 맞는 데이터베이스를 사용하는 것은 중요합니다. 오픈소스 데이터베이스들은 서로 닮아가며 진화하고 있기에 내 서비스에 적합한 데이터베이스를 선택하는 것은 여전히 어려운 과제입니다. 이 세션에서는 다양한 워크로드에 따른 적합한 오픈소스 데이터베이스를 알아보고, 이와 매핑되는 AWS 매니지드 데이터베이스 서비스를 함께 소개합니다.
Ranger provides centralized authorization policies for Hadoop resources. A new Ranger Hive Metastore security agent was introduced to address gaps in authorizing Hive CLI and synchronizing policies with object changes. It enforces authorization for Hive CLI, synchronizes access control lists when objects are altered via DDL in HiveServer2, and provides consistent authorization between Hive and HDFS resources. The agent has been implemented through custom classes extending Hive authorization hooks. A demo showed the new capabilities working with Ranger, Hive, and Hadoop.
[Bespin Global 파트너 세션] 분산 데이터 통합 (Data Lake) 기반의 데이터 분석 환경 구축 사례 - 베스핀 글로벌 장익...Amazon Web Services Korea
기업 환경에 따라 차이는 있겠지만, 최근 대부분의 기업은 데이터 분석 환경이 구축되어 있고, 이를 기반으로 데이터를 분석하고 있습니다. 그럼에도 불구하고 현업에서는 분석하고자 하는 데이터가 없거나 변화하는 비즈니스 요건을 반영하지 못한다는 불만을 제기하고, 분석 환경을 제공하는 IT운영팀은 변화하는 비즈니스 요건에 따라 분석 환경을 적시에 제공하기 쉽지 않다는 어려움을 토로하고 있습니다. 이 해결책으로 운영시스템에 데이터베이스 형태로 존재하고 있거나, 현업의 PC에서 수작업으로 작성한 정형, 비정형 파일을 통합 관리할 수 있고, 또한 인프라 환경의 확장 및 변경을 보다 유연하게 할 수 있는 AWS Cloud 기반의 분석 환경 구축 사례를 소개하고자 합니다.
다시보기 링크: https://youtu.be/YvYfNZHMJkI
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
The document summarizes the evolution of Netflix's big data platform to meet the challenges of their growing scale. Key points include:
- Netflix now has over 65 million members in over 50 countries and supports over 1000 devices. They stream over 10 billion hours of content per quarter.
- Their traditional business intelligence stack could no longer meet the demands of scale. They transitioned to using AWS services like S3 for storage and open source tools like Kafka, Cassandra, and Parquet to enable real-time analytics and machine learning on their massive data volumes.
- Netflix has adopted an open source-first strategy and contributes back to the community as their own tools evolve to meet processing needs and achieve the necessary scale to
S3 Versioning allows multiple versions of objects to be stored in a single S3 bucket. When versioning is enabled for a bucket, S3 automatically assigns a unique version ID to each object upload or change. This prevents accidental overwriting and enables restoration of previous versions. Versioning is configured at the bucket level and version IDs are opaque strings assigned by S3 to distinguish object versions.
Modern Machine Learning (ML) represents everything as vectors, from documents, to videos, to user behavior. This representation makes it possible to accurately search, retrieve, rank, and classify different items by similarity and relevance.
Running real-time applications that rely on large numbers of such high dimensional vectors requires a new kind of data infrastructure. In this talk we will discuss the need for such infrastructure, the algorithmic and engineering challenges in working with vector data at scale, and open problems we still have no adequate solutions for.
Time permitting, I will introduce Pinecone as a solution to some of these challenges.
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
The document describes the recommended Iceberg workflow which includes 8 steps:
1) Create Iceberg tables from existing datasets or sample datasets
2) Batch insert data to prepare for time travel scenarios
3) Create security policies for fine-grained access control
4) Build BI queries for reporting
5) Build visualizations from query results
6) Perform time travel queries to audit changes
7) Optimize partition schemas to improve query performance
8) Manage and expire snapshots for table maintenance
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
Watch this talk here: https://www.confluent.io/online-talks/bridge-to-cloud-apache-kafka-migrate-gcp
Most companies start their cloud journey with a new use case, or a new application. Sometimes these applications can run independently in the cloud, but often times they need data from the on premises datacenter. Existing applications will slowly migrate, but will need a strategy and the technology to enable a multi-year migration.
In this session, we will share how companies around the world are using Confluent Cloud, a fully managed Apache Kafka® service, to migrate to Google Cloud Platform. By implementing a central-pipeline architecture using Apache Kafka to sync on-prem and cloud deployments, companies can accelerate migration times and reduce costs.
Register now to learn:
-How to take the first step in migrating to GCP
-How to reliably sync your on premises applications using a persistent bridge to cloud
-How Confluent Cloud can make this daunting task simple, reliable and performant
마이크로서비스 아키텍처로 만들어진 현대 애플리케이션에서는 관계형 데이터베이스 이외에도 각 마이크로서비스의 특징에 맞는 데이터베이스를 사용하는 것은 중요합니다. 오픈소스 데이터베이스들은 서로 닮아가며 진화하고 있기에 내 서비스에 적합한 데이터베이스를 선택하는 것은 여전히 어려운 과제입니다. 이 세션에서는 다양한 워크로드에 따른 적합한 오픈소스 데이터베이스를 알아보고, 이와 매핑되는 AWS 매니지드 데이터베이스 서비스를 함께 소개합니다.
Why ODS? The Role Of The ODS In Today’s BI World And How Oracle Technology H...C. Scyphers
The document discusses operational data stores (ODS) and their role in business intelligence. An ODS bridges the gap between operational systems optimized for transactions (OLTP) and analytical systems optimized for reporting (OLAP). It provides current, integrated data to support near real-time analysis needs. The document outlines four classes of ODS based on how frequently they are updated from source systems, ranging from real-time (Class I) to periodically loading aggregated data from a data warehouse (Class IV). ODS examples for a credit card company are also provided.
Microsoft Security - New Capabilities In Microsoft 365 E5 PlansDavid J Rosenthal
Cyberspace is the new battlefield:
We’re seeing attacks on civilians and organizations from nation states. Attacks are no longer just against governments or enterprise systems directly. We’re seeing attacks against private property—the mobile devices we carry around everyday, the laptop on our desks—and public infrastructure. What started a decade-and-a-half ago as a sense that there were some teenagers in the basement hacking their way has moved far beyond that. It has morphed into sophisticated international organized crime and, worse, sophisticated nation state attacks.
Personnel and resources are limited:
According to an annual survey of 620 IT professional across North America and Western Europe from ESG, 51% respondents claim their organization had a problem of shortage of cybersecurity skills—up from 23% in 2014.1 The security landscape is getting more complicated and the stakes are rising, but many enterprises don’t have the resources they need to meet their security needs.
Virtually anything can be corrupted:
The number of connected devices in 2018 is predict to top 11 billion – not including computers and phones. As we connect virtually everything, anything can be disrupted. Everything from the cloud to the edge needs to be considered and protected
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Amazon Web Services
Amazon Kinesis Analytics allows users to analyze streaming data using standard SQL queries. It connects to streaming data sources like Kinesis Streams or Kinesis Firehose and allows users to write SQL code to process the data in real-time. The processed data can then be delivered to multiple destinations like S3, Redshift, or additional streams. Common uses of Kinesis Analytics include generating time series analytics, creating real-time alarms and notifications, and feeding real-time dashboards. An example was provided of a real-time dashboard that aggregates streaming user data into counts by OS, quadrant, etc and outputs the results to DynamoDB every second for display on a dashboard.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
AWS May Webinar Series - Getting Started: Storage with Amazon S3 and Amazon G...Amazon Web Services
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Amazon S3 and Amazon Glacier provide developers and IT teams with secure, durable, highly-scalable object storage with no minimum fees or setup costs. In this webcast, we will provide an introduction to each service, dive deep into key features of Amazon S3 and Amazon Glacier, and explore different use cases that these services optimize.
Learning Objectives: • Business value of Amazon S3 and Amazon Glacier • Leveraging S3 for web applications, media delivery, big data analytics and backup • Leveraging Amazon Glacier to build cost effective archives • Understand the life cycle management of AWS' storage services
- Amazon Kinesis is a real-time data streaming platform that allows for processing of streaming data in the AWS cloud. It includes Kinesis Streams, Kinesis Firehose, and Kinesis Analytics.
- Kinesis Streams allows users to build custom applications to process or analyze streaming data. It is a high-throughput, low-latency service for real-time data processing over large, distributed data streams.
- Key concepts of Kinesis Streams include shards for partitioning streaming data, producers for ingesting data, data records as the unit of stored data, and consumers for reading and processing streaming data.
Amazon S3 hosts trillions of objects and is used for storing a wide range of data, from system backups to digital media. This presentation from the Amazon S3 Masterclass webinar we explain the features of Amazon S3 from static website hosting, through server side encryption to Amazon Glacier integration. This webinar will dive deep into the feature sets of Amazon S3 to give a rounded overview of its capabilities, looking at common use cases, APIs and best practice.
See a recording of this video here on YouTube: http://youtu.be/VC0k-noNwOU
Check out future webinars in the Masterclass series here: http://aws.amazon.com/campaigns/emea/masterclass/
View the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Amazon Kinesis is a managed service for real-time processing of streaming big data at any scale. It allows users to create streams to ingest and process large amounts of data in real-time. Kinesis provides high durability, performance, and elasticity through features like automatic shard management and the ability to seamlessly scale streams. It also offers integration with other AWS services like S3, Redshift, and DynamoDB for storage and analytics. The document discusses various aspects of Kinesis including how to ingest and consume data, best practices, and advantages over self-managed solutions.
This presentation provides an overview of Amazon Elastic Block Store (EBS) and key performance concepts. EBS provides persistent block level storage volumes for use with EC2 instances. It discusses the different volume types (standard and provisioned IOPS), factors that impact performance like block size and queue depth, and best practices for architecting storage for performance and availability. The presentation also provides examples of how enterprises and applications use EBS and guidelines for minimum, production and large-scale usage.
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentDataWorks Summit
In a modern society, mobile networks has become one of the most important infrastructure components. The availability of a mobile network has become even essential in areas like health care and machine to machine communication.
In 2016, Telefónica Germany begun the Customer Experience Management (CEM) project to get KPI out of the mobile network describing the participant’s experience while using the Telefónica’s mobile network. These KPI help to plan and create a better mobile network where improvements are indicated.
Telefónica is using Hortonworks HDF solution to ingest 16 billion records a day which are generated by CEM. To achieve the best out of HDF abilities some customizations have been made:
1.) Custom processors have been written to comply with data privacy rules.
2.) Nifi is running in Docker containers within a Kubernetes cluster to increase reliability of the ingestion system.
Finally, the data is presented in Hive tables and Kafka topics to be further processed. In this talk, we will present the CEM use case and how it is technically implemented as stated in (1) and (2). Most interesting part for the audience should be our experiences we have made using HDF in a Docker/Kubernetes environment since this solution is not yet officially supported.
Review of how AWS EC2 storage options have evolved, and making the right selection for your workload. Covering Amazon Elastic Block Storage, EBS and Amazon Elastic File System, EFS.
Kafka Streams State Stores Being Persistentconfluent
This document discusses Kafka Streams state stores. It provides examples of using different types of windowing (tumbling, hopping, sliding, session) with state stores. It also covers configuring state store logging, caching, and retention policies. The document demonstrates how to define windowed state stores in Kafka Streams applications and discusses concepts like grace periods.
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
The document provides an overview of the IBM FileNet P8 platform, including its core components of Content Engine, Process Engine, and Workplace/Workplace XT. It describes the key functions of the Content Engine such as storing and managing objects and content, full-text indexing, storage options, and APIs. It also outlines the main roles and configurations of the Process Engine for workflow management. Finally, it discusses the user interfaces of Workplace/Workplace XT and some additional expansion products that integrate with FileNet P8.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
This document discusses optimizing Dell PowerEdge server configurations for Hadoop deployments. It recommends tested server configurations like the PowerEdge R720 and R720XD that provide balanced compute and storage. It also recommends a reference architecture using these servers along with networking best practices and validated software configurations from Cloudera to provide a proven, optimized big data platform.
Why ODS? The Role Of The ODS In Today’s BI World And How Oracle Technology H...C. Scyphers
The document discusses operational data stores (ODS) and their role in business intelligence. An ODS bridges the gap between operational systems optimized for transactions (OLTP) and analytical systems optimized for reporting (OLAP). It provides current, integrated data to support near real-time analysis needs. The document outlines four classes of ODS based on how frequently they are updated from source systems, ranging from real-time (Class I) to periodically loading aggregated data from a data warehouse (Class IV). ODS examples for a credit card company are also provided.
Microsoft Security - New Capabilities In Microsoft 365 E5 PlansDavid J Rosenthal
Cyberspace is the new battlefield:
We’re seeing attacks on civilians and organizations from nation states. Attacks are no longer just against governments or enterprise systems directly. We’re seeing attacks against private property—the mobile devices we carry around everyday, the laptop on our desks—and public infrastructure. What started a decade-and-a-half ago as a sense that there were some teenagers in the basement hacking their way has moved far beyond that. It has morphed into sophisticated international organized crime and, worse, sophisticated nation state attacks.
Personnel and resources are limited:
According to an annual survey of 620 IT professional across North America and Western Europe from ESG, 51% respondents claim their organization had a problem of shortage of cybersecurity skills—up from 23% in 2014.1 The security landscape is getting more complicated and the stakes are rising, but many enterprises don’t have the resources they need to meet their security needs.
Virtually anything can be corrupted:
The number of connected devices in 2018 is predict to top 11 billion – not including computers and phones. As we connect virtually everything, anything can be disrupted. Everything from the cloud to the edge needs to be considered and protected
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Amazon Web Services
Amazon Kinesis Analytics allows users to analyze streaming data using standard SQL queries. It connects to streaming data sources like Kinesis Streams or Kinesis Firehose and allows users to write SQL code to process the data in real-time. The processed data can then be delivered to multiple destinations like S3, Redshift, or additional streams. Common uses of Kinesis Analytics include generating time series analytics, creating real-time alarms and notifications, and feeding real-time dashboards. An example was provided of a real-time dashboard that aggregates streaming user data into counts by OS, quadrant, etc and outputs the results to DynamoDB every second for display on a dashboard.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
AWS May Webinar Series - Getting Started: Storage with Amazon S3 and Amazon G...Amazon Web Services
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Amazon S3 and Amazon Glacier provide developers and IT teams with secure, durable, highly-scalable object storage with no minimum fees or setup costs. In this webcast, we will provide an introduction to each service, dive deep into key features of Amazon S3 and Amazon Glacier, and explore different use cases that these services optimize.
Learning Objectives: • Business value of Amazon S3 and Amazon Glacier • Leveraging S3 for web applications, media delivery, big data analytics and backup • Leveraging Amazon Glacier to build cost effective archives • Understand the life cycle management of AWS' storage services
- Amazon Kinesis is a real-time data streaming platform that allows for processing of streaming data in the AWS cloud. It includes Kinesis Streams, Kinesis Firehose, and Kinesis Analytics.
- Kinesis Streams allows users to build custom applications to process or analyze streaming data. It is a high-throughput, low-latency service for real-time data processing over large, distributed data streams.
- Key concepts of Kinesis Streams include shards for partitioning streaming data, producers for ingesting data, data records as the unit of stored data, and consumers for reading and processing streaming data.
Amazon S3 hosts trillions of objects and is used for storing a wide range of data, from system backups to digital media. This presentation from the Amazon S3 Masterclass webinar we explain the features of Amazon S3 from static website hosting, through server side encryption to Amazon Glacier integration. This webinar will dive deep into the feature sets of Amazon S3 to give a rounded overview of its capabilities, looking at common use cases, APIs and best practice.
See a recording of this video here on YouTube: http://youtu.be/VC0k-noNwOU
Check out future webinars in the Masterclass series here: http://aws.amazon.com/campaigns/emea/masterclass/
View the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Amazon Kinesis is a managed service for real-time processing of streaming big data at any scale. It allows users to create streams to ingest and process large amounts of data in real-time. Kinesis provides high durability, performance, and elasticity through features like automatic shard management and the ability to seamlessly scale streams. It also offers integration with other AWS services like S3, Redshift, and DynamoDB for storage and analytics. The document discusses various aspects of Kinesis including how to ingest and consume data, best practices, and advantages over self-managed solutions.
This presentation provides an overview of Amazon Elastic Block Store (EBS) and key performance concepts. EBS provides persistent block level storage volumes for use with EC2 instances. It discusses the different volume types (standard and provisioned IOPS), factors that impact performance like block size and queue depth, and best practices for architecting storage for performance and availability. The presentation also provides examples of how enterprises and applications use EBS and guidelines for minimum, production and large-scale usage.
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentDataWorks Summit
In a modern society, mobile networks has become one of the most important infrastructure components. The availability of a mobile network has become even essential in areas like health care and machine to machine communication.
In 2016, Telefónica Germany begun the Customer Experience Management (CEM) project to get KPI out of the mobile network describing the participant’s experience while using the Telefónica’s mobile network. These KPI help to plan and create a better mobile network where improvements are indicated.
Telefónica is using Hortonworks HDF solution to ingest 16 billion records a day which are generated by CEM. To achieve the best out of HDF abilities some customizations have been made:
1.) Custom processors have been written to comply with data privacy rules.
2.) Nifi is running in Docker containers within a Kubernetes cluster to increase reliability of the ingestion system.
Finally, the data is presented in Hive tables and Kafka topics to be further processed. In this talk, we will present the CEM use case and how it is technically implemented as stated in (1) and (2). Most interesting part for the audience should be our experiences we have made using HDF in a Docker/Kubernetes environment since this solution is not yet officially supported.
Review of how AWS EC2 storage options have evolved, and making the right selection for your workload. Covering Amazon Elastic Block Storage, EBS and Amazon Elastic File System, EFS.
Kafka Streams State Stores Being Persistentconfluent
This document discusses Kafka Streams state stores. It provides examples of using different types of windowing (tumbling, hopping, sliding, session) with state stores. It also covers configuring state store logging, caching, and retention policies. The document demonstrates how to define windowed state stores in Kafka Streams applications and discusses concepts like grace periods.
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
The document provides an overview of the IBM FileNet P8 platform, including its core components of Content Engine, Process Engine, and Workplace/Workplace XT. It describes the key functions of the Content Engine such as storing and managing objects and content, full-text indexing, storage options, and APIs. It also outlines the main roles and configurations of the Process Engine for workflow management. Finally, it discusses the user interfaces of Workplace/Workplace XT and some additional expansion products that integrate with FileNet P8.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
This document discusses optimizing Dell PowerEdge server configurations for Hadoop deployments. It recommends tested server configurations like the PowerEdge R720 and R720XD that provide balanced compute and storage. It also recommends a reference architecture using these servers along with networking best practices and validated software configurations from Cloudera to provide a proven, optimized big data platform.
The document discusses Seagate's plans to integrate hard disk drives (HDDs) with flash storage, systems, services, and consumer devices to deliver unique hybrid solutions for customers. It notes Seagate's annual revenue, employees, manufacturing plants, and design centers. It also discusses Seagate exploring the use of big data analytics and Hadoop across various potential use cases and outlines Seagate's high-level plans for Hadoop implementation.
This document discusses big data and the Apache Hadoop framework. It defines big data as large, complex datasets that are difficult to process using traditional tools. Hadoop is an open-source framework for distributed storage and processing of big data across commodity hardware. It has two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing. HDFS stores data across clusters of machines with redundancy, while MapReduce splits tasks across processors and handles shuffling and sorting of data. Hadoop allows cost-effective processing of large, diverse datasets and has become a standard for big data.
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
The document discusses the future of analytics, data integration, and business intelligence (BI) on big data platforms like Hadoop. It covers how BI has evolved from old-school data warehousing to enterprise BI tools to utilizing big data platforms. New technologies like Impala, Kudu, and dataflow pipelines have made Hadoop fast and suitable for analytics. Machine learning can be used for automatic schema discovery. Emerging open-source BI tools and platforms, along with notebooks, bring new approaches to BI. Hadoop has become the default platform and future for analytics.
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
This document discusses the past, present, and future of Python for big data analytics. It provides background on the rise of Python as a data analysis tool through projects like NumPy, pandas, and scikit-learn. However, as big data systems like Hadoop became popular, Python was not initially well-suited for problems at that scale. Recent projects like PySpark, Blaze, and Spartan aim to bring Python to big data, but challenges remain around data formats, distributed computing interfaces, and competing with Scala. The document calls for continued investment in high performance Python tools for big data to ensure its relevance in coming years.
The document provides an overview of big data and Hadoop, discussing what big data is, current trends and challenges, approaches to solving big data problems including distributed computing, NoSQL, and Hadoop, and introduces HDFS and the MapReduce framework in Hadoop for distributed storage and processing of large datasets.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
1. We provide database administration and management services for Oracle, MySQL, and SQL Server databases.
2. Big Data solutions need to address storing large volumes of varied data and extracting value from it quickly through processing and visualization.
3. Hadoop is commonly used to store and process large amounts of unstructured and semi-structured data in parallel across many servers.
Slides from the August 2021 St. Louis Big Data IDEA meeting from Sam Portillo. The presentation covers AWS EMR including comparisons to other similar projects and lessons learned. A recording is available in the comments for the meeting.
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
Great Expectations is an open-source Python library that helps validate, document, and profile data to maintain quality. It allows users to define expectations about data that are used to validate new data and generate documentation. Key features include automated data profiling, predefined and custom validation rules, and scalability. It is used by companies like Vimeo and Heineken in their data pipelines. While helpful for testing data, it is not intended as a data cleaning or versioning tool. A demo shows how to initialize a project, validate sample taxi data, and view results.
Automate your data flows with Apache NIFIAdam Doyle
Apache Nifi is an open source dataflow platform that automates the flow of data between systems. It uses a flow-based programming model where data is routed through configurable "processors". Nifi was donated to the Apache Foundation by the NSA in 2014 and has over 285 processors to interact with data in various formats. It provides an easy to use UI and allows users to string together processors to move and transform data within "flowfiles" through the system in a secure manner while capturing detailed provenance data.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
Slides from the January 2021 St. Louis Big Data IDEA meeting by Tim Bytnar regarding using Docker containers for a localized Hadoop development cluster.
The document discusses Cloudera's enterprise data cloud platform. It notes that data management is spread across multiple cloud and on-premises environments. The platform aims to provide an integrated data lifecycle that is easier to use, manage and secure across various business use cases. Key components include environments, data lakes, data hub clusters, analytic experiences, and a central control plane for management. The platform offers both traditional and container-based consumption options to provide flexibility across cloud, private cloud and on-premises deployment.
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
The document provides an overview of the key steps for operationalizing data science projects:
1) Identify the business goal and refine it into a question that can be answered with data science.
2) Acquire and explore relevant data from internal and external sources.
3) Cleanse, shape, and enrich the data for modeling.
4) Create models and features, test them, and check with subject matter experts.
5) Evaluate models and deploy the best one with ongoing monitoring, optimization, and explanation of results.
Slides from the December 2019 St. Louis Big Data IDEA meetup group. Jon Leek discussed how the St. Louis Regional Data Alliance ingests, stores, and reports on their data.
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
Slides from the November St. Louis Big Data IDEA. Anthony Melson talked about how to engineer machine learning practices to better support prescriptive analytics.
Synthesis of analytical methods data driven decision-makingAdam Doyle
This document summarizes Dr. Haitao Li's presentation on synthesizing analytical methods for data-driven decision making. It discusses the three pillars of analytics - descriptive, predictive, and prescriptive. Various data-driven decision support paradigms are presented, including using descriptive/predictive analytics to determine optimization model inputs, sensitivity analysis, integrated simulation-optimization, and stochastic programming. An application example of a project scheduling and resource allocation tool for complex construction projects is provided, with details on its optimization model and software architecture.
Data Engineering and the Data Science LifecycleAdam Doyle
Everyone wants to be a data scientist. Data modeling is the hottest thing since Tickle Me Elmo. But data scientists don’t work alone. They rely on data engineers to help with data acquisition and data shaping before their model can be developed. They rely on data engineers to deploy their model into production. Once the model is in production, the data engineer’s job isn’t done. The model must be monitored to make sure that it retains its predictive power. And when the model slips, the data engineer and the data scientist need to work together to correct it through retraining or remodeling.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
2. Confidential and Proprietary to Daugherty Business Solutions 2
Agenda
• What’s the Big Data Innovation, Data Engineering, Analytics Group?
• Data Modelling in Hadoop
• Questions
5. Confidential and Proprietary to Daugherty Business Solutions
• What is the future of the Hadoop ecosystem?
• What is the dividing line between Spark and Hadoop?
• What are the big players doing?
• How does the push to cloud technologies affect Hadoop usage?
• How does Streaming come into play?
5
Which led to some questions
6. Confidential and Proprietary to Daugherty Business Solutions
• Hadoop is here to stay, but it will make the most strides as a machine learning platform.
• Spark can perform many of the same tasks that elements of the Hadoop ecosystem can,
but it is missing some existing features out of the box.
• Cloudera, Hortonworks, and MapR are positioning themselves as data processing
platforms with roots in Hadoop, but other aspirations. For example, Cloudera is
positioning itself as a machine learning platform.
• The push to cloud means that the distributed filesystem of HDFS may be less important
to cloud-based deployments. But Hadoop ecosystem projects are adapting to be able to
work with cloud sources.
• The Hadoop ecosystem projects have proven patterns for ingesting streaming data and
turning it into information.
And then our answer
6
7. Confidential and Proprietary to Daugherty Business Solutions
• We’re now going to be
St. Louis Big Data Innovation, Data Engineering, and Analytics Group
Or more simply put:
St. Louis Big Data IDEA
Introducing …
7
8. Confidential and Proprietary to Daugherty Business Solutions
• Local Companies
• Big Data
– Hadoop
– Cloud deployments
– Cloud-native technologies
– Spark
– Kafka
• Innovation
– New Big Data projects
– New Big Data services
– New Big Data applications
• Data Engineering
– Streaming data
– Batch data analysis
– Machine Learning Pipelines
– Data Governance
– ETL @ Scale
• Analytics
– Visualization
– Machine Learning
– Reporting
– Forecasting
So What is the STL Big Data IDEA interested in?
8
9. Confidential and Proprietary to Daugherty Business Solutions
• Scott Shaw has been with Hortonworks for four years.
• He is the author of four books including Practical Hive and Internet of Things and Data
Analytics Handbook.
• Scott will be helping our group find speakers
in the open source community.
Please help me welcome Scott to the group
in his new role
Introducing our New Board Member
9
10. Confidential and Proprietary to Daugherty Business Solutions 10
Agenda
• The Schema-on-Read Promise
• File formats and Compression formats
• Schema Design – Data Layout
• Indexes, Partitioning and Bucketing
• Join Performance
• Hadoop SQL Boost – Tez, Cost Based Optimizations & LLAP
• Summary
11. Confidential and Proprietary to Daugherty Business Solutions
Introducing our Speakers
Adam Doyle
• Co-Organizer, St. Louis Big Data IDEA
• Big Data Community Lead, Daugherty
Business Solutions
Drew Marco
• Board Member & Secretary, TDWI
• Data and Analytics Line of Service
Leader, Daugherty Business Solutions
11
12. Confidential and Proprietary to Daugherty Business Solutions 12
Schema On Read
• Schemas are typically purpose-built and hard to
change
• Generally loses the raw/atomic data as a source
• Requires considerable modeling/implementation
effort before being able to work with the data
• If a certain type of data can’t be confined in the
schema, you can’t effectively store or use it (if you
can store it at all)
Schema on Write Schema on Read
• Slower Results
• Preserve the raw/atomic data as a source
• Flexibility to add, remove and modify columns
• Data may be riddled with missing or invalid data,
duplicates
• Suited for data exploration and not
recommended for repetitive querying and high
performance
Real world use of Hadoop / Hive that require high performing queries on
large data sets requires up-front planning and data modeling
13. Confidential and Proprietary to Daugherty Business Solutions 13
Schema Design – Data Layout
Normalization
“The primary reason to avoid normalization is to
minimize disk seeks, such as those typically required to
navigate foreign key relations. Denormalizing data
permits it to be scanned from or written to large,
contiguous sections of disk drives, which optimizes I/O
performance. However, you pay the penalty of
denormalization, data duplication and the greater risk
of inconsistent data.”
Source: Programming Hive by Dean Wampler, Jason
Rutherglen, Edward Capriolo, O’Reilly Media
Denormalization
• Pros
• Reduces data redundancy
• Decreases risk of inconsistent datasets
• Cons
• Requires re-organization of source data
• Less efficient storage
• Pros
• Often requires reorganizing the data (slower writes)
• Minimizes disk seeks (i.e. FK relations)
• Storage in large contiguous disk drive segments
• Cons
• Data Duplication
• Increased Risk of inconsistent data
14. Confidential and Proprietary to Daugherty Business Solutions 14
Introducing Our Use Case
Departments
Dept_no
Name
Dept_emp
Dept_no
Emp_no
From_date
To_date
Employees
Emp_no
Birth_date
First_Name
Last_Name
Gender
Hire_date
Dept_Manager
Dept_no
Emp_no
From_date
To_date
Titles
Emp_no
Title
From_date
To_date
Salaries
Emp_no
Salary
From_date
To_date
https://dev.mysql.com/doc/employee/en/
15. Confidential and Proprietary to Daugherty Business Solutions 15
Data Storage Decisions
• Hadoop is a file system - No Standard data storage format in Hadoop
• Optimal storage of data is determined by how the data will be processed
• Typical input data is in JSON, XML or CSV
Major Considerations:
File Formats Compression
16. Confidential and Proprietary to Daugherty Business Solutions 16
Parquet
• Faster access to data
• Efficient columnar compression
• Effective for select queries
17. Confidential and Proprietary to Daugherty Business Solutions 17
ORCFile
High Performance: Split-able, columnar
storage file Efficient Reads: Break into
large “stripes” of data for
efficient read
Fast Filtering: Built in index, min/max,
metadata for fast filtering blocks - bloom
filters if desired
Efficient Compression: Decompose
complex row types into primitives: massive
compression and efficient comparisons for
filtering
Precomputation: Built in aggregates per
block (min, max, count, sum, etc.)
Proven at 300 PB scale: Facebook uses
ORC for their 300 PB Hive Warehouse
18. Confidential and Proprietary to Daugherty Business Solutions 18
Avro
• JSON based schema
• Cross-language file format for Hadoop
• Schema evolution was primary goal – Good for Select * queries
• Schema segregated from data
• Row major format
19. Confidential and Proprietary to Daugherty Business Solutions
Query Text Avro ORC Parquet
select count(*) from employees e join
salaries s on s.emp_no = e.emp_no join
titles t on t.emp_no = e.emp_no;
42.696 48.934 25.846 26.081
select d.name, count(1), d.first_name,
d.last_name from (select d.dept_no,
d.dept_name as name, m.first_name as
first_name, m.last_name as last_name
from departments d join dept_manager
dm on dm.dept_no = d.dept_no join
employees m on dm.emp_no =
m.emp_no where dm.to_date='9999-01-
01') d join dept_emp de on de.dept_no
= d.dept_no join employees e on
de.emp_no = e.emp_no group by
d.name, d.first_name, d.last_name;
59.536 63.08 27.954 26.073
Size 124M 134M 16.7M 30.5M
19
Comparison of file formats
20. Confidential and Proprietary to Daugherty Business Solutions 20
Compression
• Not just for storage (data-at-rest) but also critical for disk/network I/O (data-in-
motion)
• Splittability of the compression codec is an important consideration
Snappy LZO
• High speed with
reasonable compression
• Not splittable – only
used with Avro
• Optimized for speed as
opposed to size
• Splittable but requires
additional indexing
• Not shipped with
Hadoop
Gzip
• Optimized for size
• Write performance is
half of snappy
• Read performance as
good as snappy
• Smaller blocks = better
performance
bzip2
• Optimized for size
(9% better
compared to Gzip)
• Splittable
• Performance
sucks; Primary use
is archival on
Hadoop
21. Confidential and Proprietary to Daugherty Business Solutions 21
Partitioning & Bucketing
• Partitioning is useful for chronological columns that don’t have a very high number of
possible values
• Bucketing is most useful for tables that are “most often” joined together on the same
key
• Skews useful when one or two column values dominate the table
22. Confidential and Proprietary to Daugherty Business Solutions 22
Partitioning
• Every query reads the entire table even when processing subset of
data (full-table scan)
• Breaks up data horizontally by column value sets
• When partitioning you will use 1 or more “virtual” columns break up
data
• Virtual columns cause directories to be created in HDFS.
• Static Partitioning versus Dynamic Partitioning
• Partitioning makes queries go fast.
• Partitioning works particularly well when querying with the “virtual column”
• If queries use various columns, it may be hard to decide which columns
should we partition by
23. Confidential and Proprietary to Daugherty Business Solutions 23
Bucketing
• Used to strike a balance between large files within partition
• Breaks up data vertically by hashed key sets
• When bucketing, you specify the number of buckets
• Works particularly well when a lot of queries contain joins
• Especially when the two data sets are bucketed on the join key
24. Confidential and Proprietary to Daugherty Business Solutions 24
Comparison
Query Text Partition Bucketed
select d.name, count(1), d.first_name,
d.last_name from (select d.dept_no,
d.dept_name as name, m.first_name as
first_name, m.last_name as last_name
from departments d join dept_manager
dm on dm.dept_no = d.dept_no join
employees m on dm.emp_no =
m.emp_no where dm.to_date='9999-01-
01') d join dept_emp_buck de on
de.dept_no = d.dept_no join emp_buck e
on de.emp_no = e.emp_no group by
d.name, d.first_name, d.last_name;
59.536 59.652 55.196
25. Confidential and Proprietary to Daugherty Business Solutions 25
Join Performance
Map Side Joins
• Star schemas (e.g. dimension tables)
Good when table is small enough to fit
in RAM
26. Confidential and Proprietary to Daugherty Business Solutions 26
Reduce Side Joins
Default Hive Join
Works with data of any size
27. Confidential and Proprietary to Daugherty Business Solutions
Query Map-Side Reduce
select /*+ MAPJOIN(d) */ d.name, count(1),
d.first_name, d.last_name from (select d.dept_no,
d.dept_name as name, m.first_name as
first_name, m.last_name as last_name from
departments d join dept_manager dm on
dm.dept_no = d.dept_no join employees m on
dm.emp_no = m.emp_no where
dm.to_date='9999-01-01') d
join dept_emp_buck de on de.dept_no =
d.dept_no
join emp_buck e on de.emp_no = e.emp_no
group by d.name, d.first_name, d.last_name;
58.227 59.652
27
Comparison
29. Confidential and Proprietary to Daugherty Business Solutions
• Hive uses a Cost-Based Optimizer to optimize the cost of running a query.
• Calcite applies optimizations like query rewrite, join reordering, join elimination, and
deriving implied predicates.
• Calcite will prune away inefficient plans in order to produce and select the cheapest
query plans.
• Needs to be enabled:
Set hive.cbo.enable=true;
Set hive.stats.autogather=true;
29
CBO – Cost Based Optimization
CBO Process Overview
1. Parse and validate query
2. Generate possible execution plans
3. For each logically equivalent plan,
assign a cost
4. Select the plan with the lowest
cost
Optimization Factors
• Join optimization
• Table size
30. Confidential and Proprietary to Daugherty Business Solutions
• Consists of a long-lived daemon and a tightly integrated DAG framework.
• Handles
– Pre-fetching
– Some Query Processing
– Fine-grained column-level Access Control
30
LLAP
32. Confidential and Proprietary to Daugherty Business Solutions
Daugherty Overview
32
Combining world-class capabilities
with a local practice model
Long-term consultant employees
with deep business acumen &
leadership abilities
Providing more experienced
consultants & leading
methods/techniques/tools to:
• Accelerate results & productivity
• Provide greater team continuity
• More sustainable/cost effective
price point.
Over 1000 employees
from Management
Consultants to
Developers
88% of our clients are
long-term, repeat/referral
relationships of 10+ years
Demonstrated 31 year
track record of delivering
mission critical initiatives
enabled by emerging
technologies
1000
Engagements with over
75 Fortune 500 industry
leaders over the past
five years
ATLANTA
CHICAGO
DALLAS
DENVER
MINNEAPOLIS
NEW YORK
SAINT LOUIS (HQ)
DEVELOPMENT CENTER
SUPPORT & HARDWARE CENTER
9BUSINESS
UNITS
75
88%
31
BY THE NUMBERS
32
COLLABORATIVE
Co-staffed teams, project
Services, resource pools,
collaborative managed
services
PRAGMATIC
Pragmatic, co-staffed
approach well suited to
building internal
competency while getting
key project initiates
completed
ALTERNATIVE
Strong Alternative to the
Global Consultancies
FLEXIBLE
Flexible engagement
model
33. Confidential and Proprietary to Daugherty Business Solutions 33
Data & Analytics - What we bring to the table
APPLICATION
DEVELOPMENT
Methods / Tools / Techniques
• 12 Domain EIM Blueprint/Roadmap framework that
manages technical complexity, accelerates initiatives and
focuses on delivering greatest business analytics impact quickly.
• Highly accurate BI Dimensional estimator that provides
predictability in investments and time to market.
• Analytic Strategy framework that aligns people, process and
technology components to deliver business value
• Analytic Governance reference model that mitigates risk and
provide guardrails for self-service adoption
• Business value models to calculate the value and ROI of
investments in Data & Analytics initiatives
• Reference architecture for a modern data & analytic
platform
• Dashboard Design best practices that transform complex
business KPIs in a rich immersive design
• Bi-Modal Data as a Service Operating Model that integrates
Agile development with a Service oriented organization design
PROGRAM
& PROJECT
MANAGEMENT
• Program & Project Planning
• Program & Project
Management
• Business Case Development
• PMO Optimization
• M&A Integration
4
Data & Analytics
Over 40% of Daugherty’s 1,000 consultants are focused on
Information Management Solutions.
Bringing the latest thought leadership in Next Generation,
Unified Architectures that integrate structured, unstructured
data (“Big Data”) and applied advanced analytics into cohesive
solutions.
Strong capabilities across both existing and emerging
technologies while maintaining a technology neutral
approach.
Leveraging the latest visual design concepts to deliver
interactive and user friendly applications that drive adoption and
satisfaction with solutions.
Leader in the effective application of Agile techniques applied to
Data Engineering development and business analytics. Full Data
life cycle methods & techniques from business definition through
development and on-going support
Building and supporting mission-critical platforms for
many Fortune 500 companies in multi-year, using a flexible
support model including Collaborative Managed Services models.
DATA & ANALYTICS
• Data & Analytics Strategy & Roadmap
• Building Analytic Solutions
• Analytics Competency Development
• Big Data / Next Gen Architecture
• Business Analytics and Insights
33
Editor's Notes
An updated version of ORC was released in HDP 2.6.3 with better support for vectorization.
Although compression can greatly optimize processing performance, not all compression codecs supported on Hadoop are splittable. Since the MapReduce framework splits data for input to multiple tasks, having a non-splittable compression codec provides an impediment to efficient processing. If files cannot be split, that means the entire file needs to be passed to a single MapReduce task, eliminating the advantages of parallelism and data locality that Hadoop provides. For this reason, splittability is a major consideration in choosing a compression codec, as well as file format. We’ll discuss the various compression codecs available for Hadoop, and some considerations in choosing between them.
Snappy
Snappy is a compression codec developed at Google for high compression speeds with reasonable compression. Although Snappy doesn’t offer the best compression sizes, it does provide a good trade-off between speed and size. Processing performance with Snappy can be significantly better than other compression formats. An important thing to note is that Snappy is intended to be used with a container format like SequenceFiles or Avro, since it’s not inherently splittable.
LZO
LZO is similar to Snappy in that it’s optimized for speed as opposed to size. Unlike Snappy, LZO compressed files are splittable, but this requires an additional indexing step. This makes LZO a good choice for things like plain text files that are not being stored as part of a container format. It should also be noted that LZO’s license prevents it from being distributed with Hadoop, and requires a separate install, unlike Snappy, which can be distributed with Hadoop.
Gzip
Gzip provides very good compression performance (on average, about 2.5 times the compression that’d be offered by snappy), but its write speed performance is not as good as Snappy (on average, about half of that offered by snappy). Gzip usually performs almost as good as snappy in terms of read performance. Gzip is also not splittable, so should be used with a container format. It should be noted that one reason Gzip is sometimes slower than Snappy for processing is that due to Gzip compressed files taking up fewer blocks, fewer tasks are required for processing the same data. For this reason, when using Gzip, using smaller blocks can lead to better performance.
bzip2
bzip2 provides excellent compression performance, but can be significantly slower than other compression codecs such as Snappy in terms of processing performance. For this reason, it’s not an ideal codec for Hadoop storage, unless the primary need is for reducing the storage footprint. Unlike Snappy and gzip, bzip2 is inherently splittable. In the examples we have seen, bzip2 will normally compress around 9% better as compared to GZip, in terms of storage space. However, this extra compression comes with a significant read/write performance cost. This performance difference will vary with different machines but in general it’s about 10x slower then GZip. For this reason, it’s not an ideal codec for Hadoop storage, unless the primary need is for reducing the storage footprint. Example of such a use case can be where Hadoop is being used mainly for active archival purposes.
Multi-layer Partitioning is possible but often not efficient– Number of partitions becomes too much and will overwhelm the Metastore
• Limit the number of partitions. Less may be better – 1000 partitions will often perform better than 10000
• Hadoop likes big files– avoid creating partitions with mostly small files
• Only use when– Data is very large and there are lots of table scans
– Data is queried aginst a particular column frequently – Column data must have low cardinality
The map-side join can only be achieved if it is possible to join the records by key during read of the input files, so before the map phase. Additionally for this to work the input files need to be sorted by the same join key. Further more both inputs need to have the same number of partitions.
Reaching these strict constraints is commonly hard to achieve. The most likely scenario for a map-side join is when both input tables were created by (different) MapReduce jobs having the same amount of reducers using the same (join) key.
Set hive.auto.convert.join = true• HIVE then automatically uses broadcast join, if possible
– Small tables held in memory by all nodes• Used for star-schema type joins common in Data warehousing use-cases
• hive.auto.convert.join.noconditionaltask.size determines data size for automatic conversion to broadcast join:
– Default 10MB is too low (check your default) – Recommended: 256MB for 4GB container
Summary (top left)
Insert box from capabilities overview slide
Methods / Tools / Techniques (bottom left)
What unique tools and techniques do we bring to the table?
Identify differentiating methods, tools and techniques. Include graphics / images as appropriate to enhance and create impact.
Capabilities (right)
Create key points for each of these capabilities from the Daugherty Capabilities Overview. Confirm or update capabilities as appropriate. Comments should be specific and differentiating to the extent possible.
What are the only things that Daugherty can say?