Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
Data Exploration with Apache Drill: Day 1Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Data Exploration with Apache Drill: Day 2Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.
This document discusses Apache Drill, an open source SQL query engine for analyzing data in non-relational data stores like JSON, CSV, and Hadoop data formats. It provides an overview of Drill's key features such as its ability to query diverse data sources with a simple SQL interface without requiring schemas, its SQL-on-Everything model, high performance through columnar storage and execution, and its ability to scale from a single machine to large clusters. The document also demonstrates how to install Drill, configure data sources, and run queries against sample Yelp data to analyze reviews, users, and businesses.
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
The open source project Apache Drill gives you SQL-on-Hadoop, but with some big differences. The biggest difference is that Drill extends ANSI SQL from a strongly typed language to also a late binding language without losing performance. This allows Drill to process complex structured data like JSON in addition to relational data. By dynamically generating a schema at read time that matches the data types and structures observed in the data, Drill gives you both self-service agility and speed.
Drill also introduces a view-based security model that uses file system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These extensions have huge practical impact when it comes to writing real applications.
In these slides, Tugdual Grall, Technical Evangelist at MapR, gives several practical examples of how Drill makes it easy to analyze data, using SQL in your Java application with a simple JDBC driver.
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly Apache Drill started off with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats. At a high level, this translates to two key requirements – Schema Flexibility and Performance. This session will delve into the architectural details in delivering these two requirements and will share with the audience the nuances and pitfalls we ran into while developing Apache Drill.
Data Exploration with Apache Drill: Day 1Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Data Exploration with Apache Drill: Day 2Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.
This document discusses Apache Drill, an open source SQL query engine for analyzing data in non-relational data stores like JSON, CSV, and Hadoop data formats. It provides an overview of Drill's key features such as its ability to query diverse data sources with a simple SQL interface without requiring schemas, its SQL-on-Everything model, high performance through columnar storage and execution, and its ability to scale from a single machine to large clusters. The document also demonstrates how to install Drill, configure data sources, and run queries against sample Yelp data to analyze reviews, users, and businesses.
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
The open source project Apache Drill gives you SQL-on-Hadoop, but with some big differences. The biggest difference is that Drill extends ANSI SQL from a strongly typed language to also a late binding language without losing performance. This allows Drill to process complex structured data like JSON in addition to relational data. By dynamically generating a schema at read time that matches the data types and structures observed in the data, Drill gives you both self-service agility and speed.
Drill also introduces a view-based security model that uses file system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These extensions have huge practical impact when it comes to writing real applications.
In these slides, Tugdual Grall, Technical Evangelist at MapR, gives several practical examples of how Drill makes it easy to analyze data, using SQL in your Java application with a simple JDBC driver.
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly Apache Drill started off with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats. At a high level, this translates to two key requirements – Schema Flexibility and Performance. This session will delve into the architectural details in delivering these two requirements and will share with the audience the nuances and pitfalls we ran into while developing Apache Drill.
This document provides an overview of Apache Drill and how it enables ad-hoc querying and analysis of structured and unstructured data stored in Hadoop. Some key points:
1) Apache Drill allows for schema-free SQL queries against data in HDFS, HBase, Hive and other data sources, empowering self-service data exploration and "zero-day" analytics.
2) Drill's queries can handle complex, nested data through features like automatic schema discovery, repeated value support, and SQL extensions.
3) Examples show how Drill provides a familiar SQL interface and tooling to analyze JSON, text and other file formats to gain insights from large volumes of real-time data.
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
This document introduces Apache Drill, an open source interactive analysis engine for big data. It was inspired by Google's Dremel and supports standard SQL queries over various data sources like Hadoop and NoSQL databases. Drill provides low-latency interactive queries at scale through its distributed, schema-optional architecture and support for nested data formats. The talk outlines Drill's capabilities and status as a community-driven project under active development.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
Analyzing Real-World Data with Apache Drilltshiran
This document provides an overview of Apache Drill, an open source SQL query engine for analysis of both structured and unstructured data. It discusses how Drill allows for schema-free querying of data stored in Hadoop, NoSQL databases and other data sources using SQL. The document outlines some key features of Drill, such as its flexible data model, ability to discover schemas on the fly, and distributed execution architecture. It also presents examples of using Drill to analyze real-world data from sources like HDFS, MongoDB and more.
Apache Drill (http://incubator.apache.org/drill/) is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is designed to scale to thousands of servers and able to process Petabytes of data in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community, attracting hundreds of interested individuals and companies. In the talk we discuss how Apache Drill enables ad-hoc interactive query at scale, walking through typical use cases and delve into Drill's architecture, the data flow and query languages as well as data sources supported.
Hive is a data warehouse system built on top of Hadoop that allows users to query large datasets using SQL. It is used at Facebook to manage over 15TB of new data added daily across a 300+ node Hadoop cluster. Key features include using SQL for queries, extensibility through custom functions and file formats, and optimizations for performance like predicate pushdown and partition pruning.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
Apache Drill is new Apache incubator project. It's goal is to provide a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel technology, it aims to process trillions of records in seconds. We will cover the goals of Apache Drill, its use cases and how it relates to Hadoop, MongoDB and other large-scale distributed systems. We'll also talk about details of the architecture, points of extensibility, data flow and our first query languages (DrQL and SQL).
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2shXBpj
This CloudxLab Apache Spark - Loading & Saving data tutorial helps you to understand Loading & Saving data in Apache Spark in detail. Below are the topics covered in this tutorial:
1) Common Data Sources
2) Common Supported File Formats
3) Handling Text Files using Scala
4) Loading CSV
5) SequenceFiles
6) Object Files
7) Hadoop Input and Output Format - Old and New API
8) Protocol Buffers
9) File Compression
10) Handling LZO
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
Lightning fast analytics with Spark and CassandraRustam Aliyev
Spark is an open-source cluster computing framework that provides fast and general engine for large-scale data processing. It is up to 100x faster than Hadoop for certain applications. The Cassandra Spark driver allows accessing Cassandra tables as resilient distributed datasets (RDDs) in Spark, enabling analytics like joins, aggregations, and machine learning on Cassandra data. It maps Cassandra data types to Scala types and rows to case classes. This allows querying, transforming, and saving data to and from Cassandra using Spark's APIs and optimizations for performance and fault tolerance.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Merlin: The Ultimate Data Science EnvironmentCharles Givre
Merlin is a virtual computing environment developed by data scientists for data scientists. Merlin is free and open source, and contains a suite of all the best open source data science tools including data visualization tools, programming languages, big data tools, databases, notebooks, IDEs, and much more. The goal of Merlin is to allow data scientists to do data science work, not sysadmin.
This document provides an overview of Apache Drill and how it enables ad-hoc querying and analysis of structured and unstructured data stored in Hadoop. Some key points:
1) Apache Drill allows for schema-free SQL queries against data in HDFS, HBase, Hive and other data sources, empowering self-service data exploration and "zero-day" analytics.
2) Drill's queries can handle complex, nested data through features like automatic schema discovery, repeated value support, and SQL extensions.
3) Examples show how Drill provides a familiar SQL interface and tooling to analyze JSON, text and other file formats to gain insights from large volumes of real-time data.
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
This document introduces Apache Drill, an open source interactive analysis engine for big data. It was inspired by Google's Dremel and supports standard SQL queries over various data sources like Hadoop and NoSQL databases. Drill provides low-latency interactive queries at scale through its distributed, schema-optional architecture and support for nested data formats. The talk outlines Drill's capabilities and status as a community-driven project under active development.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
Analyzing Real-World Data with Apache Drilltshiran
This document provides an overview of Apache Drill, an open source SQL query engine for analysis of both structured and unstructured data. It discusses how Drill allows for schema-free querying of data stored in Hadoop, NoSQL databases and other data sources using SQL. The document outlines some key features of Drill, such as its flexible data model, ability to discover schemas on the fly, and distributed execution architecture. It also presents examples of using Drill to analyze real-world data from sources like HDFS, MongoDB and more.
Apache Drill (http://incubator.apache.org/drill/) is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is designed to scale to thousands of servers and able to process Petabytes of data in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community, attracting hundreds of interested individuals and companies. In the talk we discuss how Apache Drill enables ad-hoc interactive query at scale, walking through typical use cases and delve into Drill's architecture, the data flow and query languages as well as data sources supported.
Hive is a data warehouse system built on top of Hadoop that allows users to query large datasets using SQL. It is used at Facebook to manage over 15TB of new data added daily across a 300+ node Hadoop cluster. Key features include using SQL for queries, extensibility through custom functions and file formats, and optimizations for performance like predicate pushdown and partition pruning.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
Apache Drill is new Apache incubator project. It's goal is to provide a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel technology, it aims to process trillions of records in seconds. We will cover the goals of Apache Drill, its use cases and how it relates to Hadoop, MongoDB and other large-scale distributed systems. We'll also talk about details of the architecture, points of extensibility, data flow and our first query languages (DrQL and SQL).
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2shXBpj
This CloudxLab Apache Spark - Loading & Saving data tutorial helps you to understand Loading & Saving data in Apache Spark in detail. Below are the topics covered in this tutorial:
1) Common Data Sources
2) Common Supported File Formats
3) Handling Text Files using Scala
4) Loading CSV
5) SequenceFiles
6) Object Files
7) Hadoop Input and Output Format - Old and New API
8) Protocol Buffers
9) File Compression
10) Handling LZO
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
Lightning fast analytics with Spark and CassandraRustam Aliyev
Spark is an open-source cluster computing framework that provides fast and general engine for large-scale data processing. It is up to 100x faster than Hadoop for certain applications. The Cassandra Spark driver allows accessing Cassandra tables as resilient distributed datasets (RDDs) in Spark, enabling analytics like joins, aggregations, and machine learning on Cassandra data. It maps Cassandra data types to Scala types and rows to case classes. This allows querying, transforming, and saving data to and from Cassandra using Spark's APIs and optimizations for performance and fault tolerance.
Cost-based query optimization in Apache HiveJulian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework.
Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Merlin: The Ultimate Data Science EnvironmentCharles Givre
Merlin is a virtual computing environment developed by data scientists for data scientists. Merlin is free and open source, and contains a suite of all the best open source data science tools including data visualization tools, programming languages, big data tools, databases, notebooks, IDEs, and much more. The goal of Merlin is to allow data scientists to do data science work, not sysadmin.
What Does Your Smart Car Know About You? Strata London 2016Charles Givre
In the last few years, auto makers and technology companies have introduced a variety of devices to connect cars to the Internet and use this connectivity to gather data about the vehicles’ activity, but these connected cars gather a considerable amount of data about their owners’ activities beyond what one might expect. In aggregate and combined with other datasets, this data represents a significant degradation of personal privacy as well as a potential security risk. As auto insurers and local governments start to require this data collection, consumers should be aware of the security risks as well as the potential privacy invasions associated with this unique type of data collection.
In a follow-up to his 2015 session at Strata + Hadoop World NYC, Charles Givre examines data gathered from sensors in automobiles. Charles focuses on what kinds of data cars are gathering and asks critical questions about whether the benefits this data provides outweigh the risks and cost to personal privacy—the inevitable result of this data collection.
Ted Dunning presents information on Drill and Spark SQL. Drill is a query engine that operates on batches of rows in a pipelined and optimistic manner, while Spark SQL provides SQL capabilities on top of Spark's RDD abstraction. The document discusses the key differences in their approaches to optimization, execution, and security. It also explores opportunities for unification by allowing Drill and Spark to work together on the same data.
Strata NYC 2015 What does your smart device know about you?Charles Givre
Devices that make up the Internet of Things (IoT) collect a monumental amount of data about their owners. In most cases, the data they gather benefits the owner of the device and performs some useful purpose for them. However, when viewed in aggregate, the data gathered can reveal an enormous amount of information about the devices’ owner that can be very invasive if this information were to fall into the wrong hands.
Over the course of several months, Charles Givre did an experiment in which he collected data from several IoT devices including a Nest Thermostat, the Automatic Car dongle, the Wink hub, and a few others in order to determine what could be learned about the owner of the devices. Givre approached this experiment like a law enforcement or intelligence investigation, beginning with a bit of seed knowledge about the target, and built a profile about the target using the data that was available via these devices’ APIs and the data they transmit over the internet.
This presentation is not about how to bypass the devices’ security features, hack them, or how to mess with people by randomly turning off their A/C; but rather focuses on the consequences of IoT devices collecting and storing data.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Dokumen ini membahas rencana dan pelaksanaan anggaran Kementerian Perhubungan tahun 2011, khususnya terkait Direktorat Jenderal Perhubungan Laut. Beberapa poin kuncinya adalah telah diterbitkannya Pedoman Operasional Kegiatan tahun 2011, pengelolaan anggaran tahun 2011, dan pelaksanaan kegiatan pengarahan untuk pejabat pengelola keuangan. Juga disebutkan bahwa sejumlah kegiatan sudah dikontrakkan
Keputusan Menteri Perhubungan mengatur organisasi dan tata kerja Pangkalan Penjagaan Laut dan Pantai yang meliputi tugas, fungsi, struktur organisasi, dan tata kerja pimpinan dan pegawai pangkalan penjagaan laut dan pantai.
Dokumen tersebut membahas tentang pengawasan peredaran obat-obatan terlarang di pelabuhan dan perairan Indonesia. Dokumen tersebut menjelaskan peraturan yang berlaku, instansi terkait, dan kondisi yang diharapkan untuk meningkatkan pengawasan peredaran obat-obatan terlarang di pelabuhan dan perairan Indonesia.
Struktur organisasi Kantor Administrasi Pelabuhan Utama Tanjung Priok tahun 2002 terdiri dari Bagian Tata Usaha dan beberapa bidang yang membawahi seksi-seksi terkait untuk menjalankan fungsi pengawasan dan pengaturan lalu lintas pelayaran di pelabuhan.
El análisis de contenido no estructurado y la creciente demanda de indicadores que sinteticen rápidamente los eventos a medida que transcurren, son dos grandes tendencias que involucran, no solamente conocimientos técnicos sino de negocio que puedan agregar valor a las empresas, instituciones y personas que lo requieren.
Apache Storm es uno de los paradigmas que nacieron pensando en la era del tiempo real. Describiremos un caso de negocio que presenta el reto de capturar información y entregar conocimiento accionable lo más rápido posible. Trataremos sobre asuntos de negocio, de tecnología y filosofía con relación a la información.
The document discusses the International Ship and Port Facility Security (ISPS) Code. The ISPS Code was established as an international framework for cooperation between governments, agencies, local administrations, shipping and port industries to detect security threats and take preventative measures against security incidents. It sets out responsibilities for all involved parties at national and international levels to enhance maritime security. The goals are to ensure effective information collection and sharing related to security, provide a security assessment methodology, and ensure adequate and proportional security measures are in place.
keadaan kapal yang memenuhi persyaratan keselamatan kapal, pencegahan pencemaran perairan dari kapal, pengawakan, garis muat, pemuatan, kesejahteraan Awak Kapal dan kesehatan penumpang, status hukum kapal, manajemen keselamatan dan pencegahan pencemaran dari kapal, dan manajemen keamanan kapal untuk berlayar di perairan tertentu
Companion slides for Stormpath CTO and Co-Founder Les Hazlewood's Elegant REST Design Webinar. This presentation covers all the RESTful best practices learned building the Stormpath APIs. Whether you’re writing your first API, or just need to figure out that last piece of the puzzle, this is a great opportunity to learn more.
Stormpath is a User Management API that reduces development time with instant-on, scalable user infrastructure. Stormpath's intuitive API and expert support make it easy for developers to authenticate, manage and secure users and roles in any application.
Workshop: EmberJS - In Depth
- Ember Data - Adapters & Serializers
- Routing and Navigation
- Templates
- Services
- Components
- Integration with 3rd party libraries
Presentado por ingenieros: Mario García y Marc Torrent
This document provides an overview of DataMapper, an object-relational mapper (ORM) library for Ruby applications. It summarizes DataMapper's main features such as associations, migrations, database adapters, naming conventions, validations, custom types and stores. The document also provides examples of how to use DataMapper with different databases, import/export data, and validate models for specific contexts.
ManageIQ currently runs on Ruby on Rails 3. Aaron "tenderlove" Patterson presents his effort to migrate to RoR 4, which entails some changes in the code to take advantage of the latest advances in RoR.
For more on ManageIQ, see http://manageiq.org/
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
Kief Morris - Infrastructure is terribleThoughtworks
Why is nearly every infrastructure project I've run across a big ball of mud? We're still in the early days of infrastructure as code tooling, so we're struggling with messy glue code, configuration files, and weird custom scripts and tools. What can you do on your project to cope with the current state of tooling? And what should we, as an industry, do to level up?
AMS adapters in Rails 5 are important for building a wonderful RESTful API because they provide hypermedia controls and links in the API response. The JSON API adapter in particular generates responses that follow the JSON API specification by including links, relationships, pagination metadata, and embedded resources. Adapters allow Rails to render different representation formats like JSON API, customize the API structure, and improve the developer experience of the API through conventions like hypermedia controls and links. However, there is still work to be done on adapters, like improving deserialization, documentation, and better supporting JSON API conventions fully.
Get Soaked - An In Depth Look At PHP StreamsDavey Shafik
The document provides an in-depth overview of PHP streams, including stream basics, input/output, filters, wrappers, transports, contexts, and examples of using streams for HTTP requests, FTP operations, and a hypothetical Twitter client. It explains how streams allow modification of data in-flight and provides built-in and custom wrappers for various protocols.
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017Codemotion
Today’s applications are expected to provide powerful full-text search. But how does that work in general and how do I implement it on my site or in my application? Actually, this is not as hard as it sounds at first. This talk covers: * How full-text search works in general and what the differences to databases are. * How the score or quality of a search result is calculated. * How to implement this with Elasticsearch. Attendees will learn how to add common search patterns to their applications without breaking a sweat.
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardJAX London
2011-11-02 | 03:45 PM - 04:35 PM |
The NoSQL movement has stormed onto the development scene, and it’s left a few developers scratching their heads, trying to figure out when to use a NoSQL database instead of a regular database, much less which NoSQL database to use. In this session, we’ll examine the NoSQL ecosystem, look at the major players, how the compare and contrast, and what sort of architectural implications they have for software systems in general.
The workshop will present how to combine tools to quickly query, transform and model data using command line tools.
The goal is to show that command line tools are efficient at handling reasonable sizes of data and can accelerate the data science
process. We will show that in many instances, command line processing ends up being much faster than ‘big-data’ solutions. The content
of the workshop is derived from the book of the same name (http://datascienceatthecommandline.com/). In addition, we will cover
vowpal-wabbit (https://github.com/JohnLangford/vowpal_wabbit) as a versatile command line tool for modeling large datasets.
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
Jayesh Thakrar, Senior Systems Engineer, Conversant
The venerable HBase shell is often regarded as a simple utility to perform basic DDL and maintenance activities. However, it is in fact a powerful, interactive programming environment, primarily due to the JRuby engine under the covers. In this presentation, I'll describe its JRuby heritage and show some of the things that can be done with the "ird" (interactive ruby shell), as well as show how to exploit JRuby and Java integration via concrete working examples. In addition, I will demonstrate how the "shell" can be used in Hadoop streaming to quickly perform complex and large volume batch jobs.
Google apps script database abstraction exposed versionBruce McPherson
This document describes a database abstraction library for Google Apps Script that provides a consistent API for NoSQL databases. It allows code to be reused across different database backends by handling queries, authentication, caching, and more transparently. The library exposes the capabilities through a JSON REST API that can be accessed from other programming languages. It also includes a VBA client library that translates VBA calls into requests to the JSON API, allowing VBA code to access databases in the same way as Google Apps Script.
Building a friendly .NET SDK to connect to SpaceMaarten Balliauw
Space is a team tool that integrates chats, meetings, git hosting, automation, and more. It has an HTTP API to integrate third party apps and workflows, but it's massive! And slightly opinionated.
In this session, we will see how we built the .NET SDK for Space, and how we make that massive API more digestible. We will see how we used code generation, and incrementally made the API feel more like a real .NET SDK.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Java Web Programming [5/9] : EL, JSTL and Custom TagsIMC Institute
This document provides an overview of Expression Language (EL), JSTL (JSP Standard Tag Library) 1.1, and custom tags. It describes EL expressions, implicit objects, and attributes. It explains core JSTL tags for looping, conditionals, and URL manipulation. It also discusses using JSTL tags to connect to a database and format output. Finally, it outlines how to create a custom tag library with a TLD file, tag handler class, and JSP file to implement a simple tag.
A Drupalcon Chicago presentation for coders/developers about web application security in the Drupal system. Covering Cross Site Scripting and Cross Site Request Forgeries.
Distributed Queries in IDS: New features.Keshav Murthy
Learn about the latest function relating to distributed queries that was delivered in IBM Informix® Dynamic Server (IDS) 11 and 11.5. This talk will provide an overview of distributed queries, then will jump into a deep dive on the latest functions and how you can benefit from implementing distributed queries in your solutions.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
19. darklabs.bah.com
Querying Across Silos
SELECT t1.Borough, t1.markets, t2.rests, cast(t1.markets AS
FLOAT)/ cast(t2.rests AS FLOAT) AS ratio
FROM (
SELECT Borough, count(`Farmers Markets Name`) AS markets
FROM `farmers_markets.csv`
GROUP BY Borough ) t1
JOIN (
SELECT borough, count(name) AS rests
FROM mongo.test.`restaurants`
GROUP BY borough
) t2
ON t1.Borough=t2.borough
ORDER BY ratio DESC;
36. darklabs.bah.com
Querying Drill
Plugins Supported Description
cp Queries files in the Java ClassPath
dfs
File System. Can connect to remote filesystems such as
Hadoop
hbase Connects to HBase
hive Integrates Drill with the Apache Hive metastore
kudu Provides a connection to Apache Kudu
mongo Connects to mongoDB
RDBMS
Provides a connection to relational databases such as MySQL,
Postgres, Oracle and others.
S3 Provides a connection to an S3 cluster
40. darklabs.bah.com
In Class Exercise:
Create a Workspace
In this exercise we are going to create a workspace called ‘drillworkshop’,
which we will use for future exercises.
1. First, download all the files from https://github.com/cgivre/drillworkshop
and put them in a folder of your choice on your computer. Remember the
complete file path.
2. Open the Drill Web UI and go to Storage->dfs->update
3. Paste the following into the ‘workspaces’ section and click update
"drillworkshop": {
"location": “<path to your files>",
"writable": true,
"defaultInputFormat": null
}
45. darklabs.bah.com
Querying Drill
SELECT columns[0] AS name,
columns[1] AS JobTitle,
columns[2] AS AgencyID,
columns[3] AS Agency,
columns[4] AS HireDate,
columns[5] AS AnnualSalary,
columns[6] AS GrossPay
FROM dfs.drillworkshop.`baltimore_salaries_2015.csv`
LIMIT 10
49. darklabs.bah.com
Querying Drill
File Extension File Type
.psv Pipe separated values
.csv Comma separated value files
.csvh Comma separated value with header
.tsv Tab separated values
.json JavaScript Object Notation files
.avro Avro files (experimental)
.seq Sequence Files
50. darklabs.bah.com
Querying Drill
Options Description
comment What character is a comment character
escape Escape character
delimiter The character used to delimit fields
quote Character used to enclose fields
skipFirstLine true/false
extractHeader Reads the header from the CSV file
53. darklabs.bah.com
Aggregate Functions
Function Argument Type Return Type
AVG( expression ) Integer or Floating point Floating point
COUNT( * ) BIGINT
COUNT( [DISTINCT]
<expression> )
any BIGINT
MIN/MAX( <expression> ) Any numeric or date same as argument
SUM( <expression> ) Any numeric or interval same as argument
55. darklabs.bah.com
Querying Drill
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
SchemaChangeException: Failure while trying to materialize incoming schema.
Errors: Error in expression at index -1. Error: Missing function implementation:
[castINT(BIT-OPTIONAL)]. Full expression: --UNKNOWN EXPRESSION--..
Fragment 0:0 [Error Id: af88883b-f10a-4ea5-821d-5ff065628375 on
10.251.255.146:31010]
59. darklabs.bah.com
Querying Drill
Function Return Type
BYTE_SUBSTR BINARY or VARCHAR
CHAR_LENGTH INTEGER
CONCAT VARCHAR
ILIKE BOOLEAN
INITCAP VARCHAR
LENGTH INTEGER
LOWER VARCHAR
LPAD VARCHAR
LTRIM VARCHAR
POSITION INTEGER
REGEXP_REPLACE VARCHAR
RPAD VARCHAR
RTRIM VARCHAR
STRPOS INTEGER
SUBSTR VARCHAR
TRIM VARCHAR
UPPER VARCHAR
60. darklabs.bah.com
In Class Exercise:
Clean the field.
In this exercise you will use one of the string functions to remove the
dollar sign from the ‘AnnualPay’ column.
Complete documentation can be found here:
https://drill.apache.org/docs/string-manipulation/
SELECT LTRIM( AnnualPay, ‘$’ ) AS annualPay
FROM dfs.drillworkshop.`*.csvh`
61. darklabs.bah.com
Drill Data Types
Data type Description
Bigint 8 byte signed integer
Binary Variable length byte string
Boolean True/false
Date yyyy-mm-dd
Double / Float 8 or 4 byte floating point number
Integer 4 byte signed integer
Interval A day-time or year-month interval
Time HH:mm:ss
Timestamp JDBC Timestamp
Varchar UTF-8 encoded variable length string
63. darklabs.bah.com
In Class Exercise:
Convert to a number
In this exercise you will use the cast() function to convert AnnualPay
into a number.
Complete documentation can be found here:
https://drill.apache.org/docs/data-type-conversion/#cast
SELECT CAST( LTRIM( AnnualPay, ‘$’ ) AS FLOAT ) AS
annualPay
FROM dfs.drillworkshop.`*.csvh`
64. darklabs.bah.com
SELECT JobTitle,
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS
avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillworkshop.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
65. darklabs.bah.com
SELECT JobTitle,
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillworkshop.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
67. darklabs.bah.com
Problem: You have multiple log files which you
would like to analyze
• In the sample data files, there is a folder called ‘logs’ which
contains the following structure:
72. darklabs.bah.com
Function Description
MAXDIR(), MINDIR() Limit query to the first or last directory
IMAXDIR(), IMINDIR()
Limit query to the first or last directory in
case insensitive order.
Directory Functions
WHERE dir<n> = MAXDIR('<plugin>.<workspace>', '<filename>')
73. darklabs.bah.com
In Class Exercise:
Find the total number of items sold by year and the total
dollar sales in each year.
HINT: Don’t forget to CAST() the fields to appropriate data
types
SELECT dir0 AS data_year,
SUM( CAST( item_count AS INTEGER ) ) as total_items,
SUM( CAST( amount_spent AS FLOAT ) ) as total_sales
FROM dfs.drillworkshop.`logs/`
GROUP BY dir0
96. darklabs.bah.com
SELECT raw_data[8] AS name, raw_data[9] AS job_title
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
)
97. darklabs.bah.com
In Class Exercise
Using the JSON file, recreate the earlier query to find the average
salary by job title and how many people have each job title.
HINT: Don’t forget to CAST() the columns…
HINT 2: GROUP BY does NOT support aliases.
98. darklabs.bah.com
In Class Exercise
Using the JSON file, recreate the earlier query to find the average
salary by job title and how many people have each job title.
SELECT raw_data[9] AS job_title,
AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary,
COUNT( DISTINCT raw_data[8] ) AS person_count
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
)
GROUP BY raw_data[9]
ORDER BY avg_salary DESC
99. darklabs.bah.com
Using the JSON file, recreate the earlier query to find the average
salary by job title and how many people have each job title.
108. darklabs.bah.com
CREATE TABLE <file_name> AS <query>
CREATE TABLE dfs.drillworkshop.`salary_summary` AS
SELECT JobTitle,
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS
avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillworkshop.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
114. darklabs.bah.com
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
115. darklabs.bah.com
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
116. darklabs.bah.com
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
MySQL: 0.047 seconds
117. darklabs.bah.com
Connecting other Data Sources
MySQL: 0.047 seconds
Drill: 0.366 seconds
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM mysql.stats.batting
INNER JOIN mysql.stats.teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY teams.name
ORDER BY hr_total DESC
123. darklabs.bah.com
Connecting to Drill
query_result = drill.query('''
SELECT JobTitle,
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS
avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillworkshop.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
LIMIT 10
''')