Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Data Exploration with Apache Drill: Day 1Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
• Distributed datasets loaded into named columns (similar to relational DBs or
Python DataFrames).
• Can be constructed from existing RDDs or external data sources.
• Can scale from small datasets to TBs/PBs on multi-node Spark clusters.
• APIs available in Python, Java, Scala and R.
• Bytecode generation and optimization using Catalyst Optimizer.
• Simpler DSL to perform complex and data heavy operations.
• Faster runtime performance than vanilla RDDs.
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
This presentation gives an overview on OrientDB and Neo4j. It also compares some specific querys, their speed and the overall functionality of both databases.
The querys might not be optimized in both cases. At least they have the same outcome and are both written as querys. For sure in Neo4j you should do this in Java code. But that is way harder to write, so this presentation is more like a direkt comparision instead of really getting the best results.
Also it's done with real data and at the end round about 200 GB of data.
Working With a Real-World Dataset in Neo4j: Import and ModelingNeo4j
This webinar will cover how to work with a real world dataset in Neo4j, with a focus on how to build a graph from an existing dataset (in this case a series of JSON files). We will explore how to performantly import the data into Neo4j - both in the case of an initial import and scaling writes for your graph application. We will demonstrate different approaches for data import (neo4j-import, LOAD CSV, and using the official Neo4j drivers), and discuss when it makes to use each import technique. If you've ever asked these questions, then this webinar is for you!
- How do I design a property graph model for my domain?
- How do I use the official Neo4j drivers?
- How can I deal with concurrent writes to Neo4j?
- How can I import JSON into Neo4j?
Data Exploration with Apache Drill: Day 1Charles Givre
Study after study shows that data scientists and analysts spend between 50% and 90% of their time preparing their data for analysis. Using Drill, you can dramatically reduce the time it takes to go from raw data to insight. This course will show you how.
The course material for this presentation are available at https://github.com/cgivre/data-exploration-with-apache-drill
Study after study show that data scientists spend 50-90 percent of their time gathering and preparing data. In many large organizations this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool which can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata, or having to rebuild their entire data infrastructure. In this talk I will introduce the audience to Apache Drill—to include some hands-on exercises—and present a case study of how Drill can be used to query a variety of data sources. The presentation will cover:
* How to explore and merge data sets in different formats
* Using Drill to interact with other platforms such as Python and others
* Exploring data stored on different machines
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
• Distributed datasets loaded into named columns (similar to relational DBs or
Python DataFrames).
• Can be constructed from existing RDDs or external data sources.
• Can scale from small datasets to TBs/PBs on multi-node Spark clusters.
• APIs available in Python, Java, Scala and R.
• Bytecode generation and optimization using Catalyst Optimizer.
• Simpler DSL to perform complex and data heavy operations.
• Faster runtime performance than vanilla RDDs.
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
This presentation gives an overview on OrientDB and Neo4j. It also compares some specific querys, their speed and the overall functionality of both databases.
The querys might not be optimized in both cases. At least they have the same outcome and are both written as querys. For sure in Neo4j you should do this in Java code. But that is way harder to write, so this presentation is more like a direkt comparision instead of really getting the best results.
Also it's done with real data and at the end round about 200 GB of data.
Working With a Real-World Dataset in Neo4j: Import and ModelingNeo4j
This webinar will cover how to work with a real world dataset in Neo4j, with a focus on how to build a graph from an existing dataset (in this case a series of JSON files). We will explore how to performantly import the data into Neo4j - both in the case of an initial import and scaling writes for your graph application. We will demonstrate different approaches for data import (neo4j-import, LOAD CSV, and using the official Neo4j drivers), and discuss when it makes to use each import technique. If you've ever asked these questions, then this webinar is for you!
- How do I design a property graph model for my domain?
- How do I use the official Neo4j drivers?
- How can I deal with concurrent writes to Neo4j?
- How can I import JSON into Neo4j?
This talk will cover experiences from writing a FDW for Informix and will discuss differences between 9.1 and 9.2, as well as the new writable API with the upcoming 9.3 release, additionally data type mapping and conversion, optimizer support and performance related topics.
The talk tries to give the attendees an overall idea behind the techniques and pitfalls they may experience when they want to write their own.
Mindmap: Oracle to Couchbase for developersKeshav Murthy
This deck provides a high-level comparison between Oracle and Couchbase: Architecture, database objects, types, data model, SQL & N1QL statements, indexing, optimizer, transactions, SDK and deployment options.
Importing Data into Neo4j quickly and easily - StackOverflowNeo4j
In this GraphConnect presentation Mark and Michael show several ways to import large amounts of highly connected data from different formats into Neo4j. Both Cypher's LOAD CSV as well as the bulk importer is demonstrated along with many tips.
We use the well know StackOverflow Q&A site data which is interestingly very graphy.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Lucene and Solr provide a number of options for query parsing, and these are valuable tools for creating powerful search applications. This presentation given at the 2013 Lucene Revolution will review the role that advanced query parsing can play in building systems, including: Relevancy customization, taking input from user interface variables such as the position on a website or geographical indicators, which sources are to be searched and 3rd party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. Chief Architect Paul Nelson provides this compelling presentation.
Search Technologies provides relevancy tuning services for Solr. For further information, see http://www.searchtechnologies.com/solr-lucene-relevancy.html
http://www.searchtechnologies.com
Couchbase 5.5: N1QL and Indexing featuresKeshav Murthy
This deck contains the high-level overview of N1QL and Indexing features in Couchbase 5.5. ANSI joins, hash join, index partitioning, grouping, aggregation performance, auditing, query performance features, infrastructure features.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.
What Does Your Smart Car Know About You? Strata London 2016Charles Givre
In the last few years, auto makers and technology companies have introduced a variety of devices to connect cars to the Internet and use this connectivity to gather data about the vehicles’ activity, but these connected cars gather a considerable amount of data about their owners’ activities beyond what one might expect. In aggregate and combined with other datasets, this data represents a significant degradation of personal privacy as well as a potential security risk. As auto insurers and local governments start to require this data collection, consumers should be aware of the security risks as well as the potential privacy invasions associated with this unique type of data collection.
In a follow-up to his 2015 session at Strata + Hadoop World NYC, Charles Givre examines data gathered from sensors in automobiles. Charles focuses on what kinds of data cars are gathering and asks critical questions about whether the benefits this data provides outweigh the risks and cost to personal privacy—the inevitable result of this data collection.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
This talk will cover experiences from writing a FDW for Informix and will discuss differences between 9.1 and 9.2, as well as the new writable API with the upcoming 9.3 release, additionally data type mapping and conversion, optimizer support and performance related topics.
The talk tries to give the attendees an overall idea behind the techniques and pitfalls they may experience when they want to write their own.
Mindmap: Oracle to Couchbase for developersKeshav Murthy
This deck provides a high-level comparison between Oracle and Couchbase: Architecture, database objects, types, data model, SQL & N1QL statements, indexing, optimizer, transactions, SDK and deployment options.
Importing Data into Neo4j quickly and easily - StackOverflowNeo4j
In this GraphConnect presentation Mark and Michael show several ways to import large amounts of highly connected data from different formats into Neo4j. Both Cypher's LOAD CSV as well as the bulk importer is demonstrated along with many tips.
We use the well know StackOverflow Q&A site data which is interestingly very graphy.
This webinar will give an overview of CREATE STATISTICS in PostgreSQL. This command allows the database to collect multi-column statistics, helping the optimizer understand dependencies between columns, produce more accurate estimates, and better query plans.
The following key topics will be covered during the webinar:
- Why CREATE STATISTICS may be needed at all
- How the command works
- Which cases CREATE STATISTICS already addresses
- What improvements are in the queue for future PostgreSQL versions (either already committed to PostgreSQL 13 or beyond)
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Lucene and Solr provide a number of options for query parsing, and these are valuable tools for creating powerful search applications. This presentation given at the 2013 Lucene Revolution will review the role that advanced query parsing can play in building systems, including: Relevancy customization, taking input from user interface variables such as the position on a website or geographical indicators, which sources are to be searched and 3rd party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. Chief Architect Paul Nelson provides this compelling presentation.
Search Technologies provides relevancy tuning services for Solr. For further information, see http://www.searchtechnologies.com/solr-lucene-relevancy.html
http://www.searchtechnologies.com
Couchbase 5.5: N1QL and Indexing featuresKeshav Murthy
This deck contains the high-level overview of N1QL and Indexing features in Couchbase 5.5. ANSI joins, hash join, index partitioning, grouping, aggregation performance, auditing, query performance features, infrastructure features.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.
What Does Your Smart Car Know About You? Strata London 2016Charles Givre
In the last few years, auto makers and technology companies have introduced a variety of devices to connect cars to the Internet and use this connectivity to gather data about the vehicles’ activity, but these connected cars gather a considerable amount of data about their owners’ activities beyond what one might expect. In aggregate and combined with other datasets, this data represents a significant degradation of personal privacy as well as a potential security risk. As auto insurers and local governments start to require this data collection, consumers should be aware of the security risks as well as the potential privacy invasions associated with this unique type of data collection.
In a follow-up to his 2015 session at Strata + Hadoop World NYC, Charles Givre examines data gathered from sensors in automobiles. Charles focuses on what kinds of data cars are gathering and asks critical questions about whether the benefits this data provides outweigh the risks and cost to personal privacy—the inevitable result of this data collection.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Merlin: The Ultimate Data Science EnvironmentCharles Givre
Merlin is a virtual computing environment developed by data scientists for data scientists. Merlin is free and open source, and contains a suite of all the best open source data science tools including data visualization tools, programming languages, big data tools, databases, notebooks, IDEs, and much more. The goal of Merlin is to allow data scientists to do data science work, not sysadmin.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
Strata NYC 2015 What does your smart device know about you?Charles Givre
Devices that make up the Internet of Things (IoT) collect a monumental amount of data about their owners. In most cases, the data they gather benefits the owner of the device and performs some useful purpose for them. However, when viewed in aggregate, the data gathered can reveal an enormous amount of information about the devices’ owner that can be very invasive if this information were to fall into the wrong hands.
Over the course of several months, Charles Givre did an experiment in which he collected data from several IoT devices including a Nest Thermostat, the Automatic Car dongle, the Wink hub, and a few others in order to determine what could be learned about the owner of the devices. Givre approached this experiment like a law enforcement or intelligence investigation, beginning with a bit of seed knowledge about the target, and built a profile about the target using the data that was available via these devices’ APIs and the data they transmit over the internet.
This presentation is not about how to bypass the devices’ security features, hack them, or how to mess with people by randomly turning off their A/C; but rather focuses on the consequences of IoT devices collecting and storing data.
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly Apache Drill started off with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats. At a high level, this translates to two key requirements – Schema Flexibility and Performance. This session will delve into the architectural details in delivering these two requirements and will share with the audience the nuances and pitfalls we ran into while developing Apache Drill.
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
El análisis de contenido no estructurado y la creciente demanda de indicadores que sinteticen rápidamente los eventos a medida que transcurren, son dos grandes tendencias que involucran, no solamente conocimientos técnicos sino de negocio que puedan agregar valor a las empresas, instituciones y personas que lo requieren.
Apache Storm es uno de los paradigmas que nacieron pensando en la era del tiempo real. Describiremos un caso de negocio que presenta el reto de capturar información y entregar conocimiento accionable lo más rápido posible. Trataremos sobre asuntos de negocio, de tecnología y filosofía con relación a la información.
James Colby Maddox Business Intellignece and Computer Science Portfoliocolbydaman
This portfolio covers the business intelligence course work I have completed at Set Focus, and some of the course work I have completed at Kennesaw State University
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandraaaronmorton
Slides from my talk at Cassandra Summit 2015
http://cassandrasummit-datastax.com/agenda/repeatable-scalable-reliable-observable-cassandra/
thelastpickle.com
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraDataStax Academy
Apache Cassandra makes it possible to write code on a laptop and deploy to multi-region clusters with a few configuration changes. But what does it take to create repeatable, scalable, reliable, and observable clusters?
In this talk Aaron Morton, Co Founder at The Last Pickle and Apache Cassandra Committer, will discuss the tools and techniques they use. From environment planning to implementation for tools such as Chef, Sensu, Graphite, Riemann and LogStash this will be a discussion of the full stack ecosystem for successful projects.
MSCD650 Final Exam feedback FormMSCD650 Final Exam Grading For.docxgilpinleeanna
MSCD650 Final Exam feedback Form
MSCD650 Final Exam Grading Form
(Instructions follow the form)
Coding
55 Percent
Points Earned
Comments:
Trigger Code:
· Code meets requirements
· Code compiles cleanly
/15
Pre-Calculation Procedure Code
· Code meets requirements
· Code compiles cleanly
/15
PL/SQL Block Code
· Code meets requirements
· Code compiles cleanly
/15
Function Code
· Code meets requirements
· Code compiles cleanly
/10
/55
Unit Testing
35 Percent
Points Earned
Comments:
Unit Test for Trigger Code:
· All conditions are thoroughly tested
· The code runs successfully
· All data to prove test worked is displayed
/10
Unit Test for Procedure Code:
· All conditions are thoroughly tested
· The code runs successfully
· All data to prove test worked is displayed
· The tester can easily follow the path of the execution.
/10
.
Unit Test for PL/SQL Block Code:
· All conditions are thoroughly tested
· The code runs successfully
· All data to prove test worked is displayed
· The tester can easily follow the path of the execution.
/10
Unit Test for View/Function Code:
· All conditions are thoroughly tested
· The code runs successfully
· All data to prove test worked is displayed
· The tester can easily follow the path of the execution.
/5
/35
Documentation
10 Percent
Points Earned
Comments:
Presentation:
· The document is easy to read.
· The document is Professional in appearance
· It is easy for the reader to find what they are looking for.
/5
Documentation:
· Code is documented so that anyone who picks it up knows what it is doing.
/5
/10
Total 100
Percent
Points Earned
Comments:
Case Study
Overview of assignment
As a new ABC Consultant assigned to the XYZ Company, you have been asked to enhance the current system to include payroll processing. Although the current employee table has monthly salary and commission columns, it does not provide any means for storing employee deductions. You will add the tables necessary to store employee deductions. Next you will create a payroll pre-calculation program that will calculate the net pay for all the employees via a batch process (a stored procedure in a package, which will call other stored procedures within the package). Although this is not a complete payroll system, the unit test results must be accurate.
Next you will create two PL/SQL blocks for inserting and deleting rows from the employee deduction table. These PL/SQL blocks will be passed information from host or bind variables and a third PL/SQL block which will assign the variables defined in SQL*Plus (e.g. employee number, dollar amount and deduction name). Since the XYZ Company wants to track changes to the employee and employee deduction tables, you will create two database triggers that will update audit tables when rows are changed or deleted.
The XYZ Company also requires a view that will display specific employee information, plus the number of deductions for an employe ...
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. Data Exploration with Apache
Drill - Day 2
Charles S. Givre
@cgivre
thedataist.com
linkedin.com/in/cgivre
2. Homework
Using the Baltimore Salaries dataset write queries that answer the following
questions:
1. In 2016, calculate the average difference in GrossPay and Annual Salary by
Agency. HINT: Include WHERE NOT( GrossPay ='' ) in your query.
For extra credit, calculate the number of people in each Agency, and the
min/max for the salary delta as well.
2. Find the top 10 individuals whose salaries changed the most between 2015
and 2016, both gain and loss.
3. (Optional Extra Credit) Using the various string manipulation functions,
split the name function into two columns for the last name and first name.
HINT: Don’t overthink this, and review the sides about the columns array if
you get stuck.
3. Homework
Using the Baltimore Salaries dataset write queries that answer the
following questions:
1. In 2016, calculate the average difference in GrossPay and Annual
Salary by Agency. HINT: Include WHERE NOT( GrossPay ='' )
in your query. For extra credit, calculate the number of people in
each Agency, and the min/max for the salary delta as well.
SELECT Agency,
AVG( TO_NUMBER( `AnnualSalary`, '¤' ) - TO_NUMBER( `GrossPay`, '¤' )) AS avg_SalaryDelta,
COUNT( DISTINCT `EmpName` ) as emp_count,
MIN( TO_NUMBER( `AnnualSalary`, '¤' ) - TO_NUMBER( `GrossPay`, '¤' ) ) min_salary_delta,
MAX( TO_NUMBER( `AnnualSalary`, '¤' ) - TO_NUMBER( `GrossPay`, '¤' ) ) max_salary_delta
FROM dfs.drillclass.`baltimore_salaries_2016.csvh`
WHERE NOT( GrossPay ='' )
GROUP BY Agency
ORDER BY avg_SalaryDelta DESC
4. Homework
Find the top 10 individuals whose salaries changed the most
between 2015 and 2016, both gain and loss.
SELECT data2016.`EmpName`,
data2016.`JobTitle` AS JobTitle_2016,
data2015.`JobTitle` AS JobTitle_2015,
data2016.`AnnualSalary` AS salary_2016,
data2015.`AnnualSalary` AS salary_2015,
(TO_NUMBER( data2016.`AnnualSalary`, '¤' ) -
TO_NUMBER( data2015.`AnnualSalary`, '¤' )) AS salary_delta
FROM dfs.drillclass.`baltimore_salaries_2016.csvh` AS data2016
INNER JOIN dfs.drillclass.`baltimore_salaries_2015.csvh` AS
data2015
ON data2016.`EmpName` = data2015.`EmpName`
ORDER BY salary_delta DESC
LIMIT 10
5. Homework
(Optional Extra Credit) Using the various string manipulation
functions, split the name function into two columns for the last
name and first name. HINT: Don’t overthink this, and review the
sides about the columns array if you get stuck.
SELECT `EmpName`,
SPLIT( `EmpName`,',' )[0] AS last_name,
SPLIT( `EmpName`,',' )[1] AS first_name
FROM dfs.drillclass.`baltimore_salaries_2016.csvh`
6. Today, we will cover:
• Dates and Times
• Nested Data (JSON)
• Other data types
• Other data sources
• Programmatically connecting to Drill
9. Working with Dates & Times
TO_DATE( <field>, ‘<format>’)
TO_TIMESTAMP( <field>, ‘<format>’)
10. Working with Dates & Times
Symbol Meaning Presentation Examples
G era text AD
C century of era (>=0) number 20
Y year of era (>=0) year 1996
x weekyear year 1996
w week of weekyear number 27
e day of week number 2
E day of week text Tuesday; Tue
y year year 1996
D day of year number 189
M month of year month July; Jul; 07
d day of month number 10
a halfday of day text PM
K hour of halfday (0~11) number 0
h clockhour of halfday (1~12) number 12
H hour of day (0~23) number 0
k clockhour of day (1~24) number 24
m minute of hour number 30
s second of minute number 55
S fraction of second number 978
z time zone text Pacific Standard Time; PST
Z time zone offset/id zone -0800; -08:00; America/Los_Angeles
escape for text delimiter
' single quote literal
12. TO_CHAR( <field>, <format> )
SELECT JobTitle,
TO_CHAR( AVG( TO_NUMBER( AnnualSalary, '¤' )), '¤#,###.00' ) AS avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillclass.`baltimore_salaries_2016.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
14. Intervals
• P (Period) marks the beginning of a period of time.
• Y follows a number of years.
• M follows a number of months.
• D follows a number of days.
• H follows a number of hours 0-24.
• M follows a number of minutes.
• S follows a number of seconds and optional milliseconds
P249D
15. IntervalsSELECT date2,
date5,
(TO_DATE( date2, 'MM/dd/yyyy' ) - TO_DATE( date5, 'yyyy-MM-dd
)) as date_diff,
EXTRACT( day FROM (TO_DATE( date2, 'MM/dd/yyyy' ) -
TO_DATE( date5, 'yyyy-MM-dd' )))
FROM dfs.drillclass.`dates.csvh`
16. Other Date/Time Functions
• AGE( timestamp ):
• EXTRACT( field FROM time_exp): Extract a part of a date, time or
interval
• CURRENT_DATE()/CURRENT_TIME()/NOW()
• DATE_ADD()/DATE_SUB(): Adds or subtracts two dates
For complete documentation: http://drill.apache.org/docs/date-time-functions-and-arithmetic/
17. In Class Exercise:
Parsing Dates and Times
In this exercise you will find a data file called dates.csvh which contains 5 columns of random dates in various formats:
• date1 is in ISO 8601 format
• date2 is MM/DD/YYYY ie: 03/12/2016
• date3 is: Sep 19, 2016
• date4 is formatted: Sun, 19 Mar 2017 00:15:28 -0700
• date5 is formatted like database dates: YYYY-mm-dd: 2016-10-03
For this exercise, complete the following steps:
1. Using the various methods, (CAST(), TO_DATE() ) we have discussed, convert each column into a date (or time) as appropriate.
2. Reformat date5 so that it is in the same format as date3.
3. Find all the dates rows where date3 occurs after date5.
4. Create a histogram table of date2 by weekday: IE: Sunday 5, Monday 4, etc
5. Find all the entries in date5 that are more than 1 year old
18. SELECT CAST( date1 AS DATE )
FROM dfs.drillclass.`dates.csvh`
SELECT date2, TO_DATE( date2, 'MM/dd/yyyy' )
FROM dfs.drillclass.`dates.csvh`
SELECT date3, TO_DATE( date3, 'MMM dd, yyyy' )
FROM dfs.drillclass.`dates.csvh`
SELECT date4, TO_TIMESTAMP( date4, 'EEE, dd MMM yyyy HH:mm:ss
Z' )
FROM dfs.drillclass.`dates.csvh`
SELECT date5, TO_DATE( date5, 'yyyy-MM-dd' )
FROM dfs.drillclass.`dates.csvh`
21. Complex data types: A data type
which holds more than one value
• Array: A complex data type indexed by number
• Map: A complex data type indexed by a key
23. Arrays in Drill
SELECT columns[0] AS first_name,
columns[1] AS last_name,
columns[2] AS birthday
FROM dfs.drillclass.`customer_data.csv`
24. Arrays in Drill
SELECT columns[0] AS first_name,
columns[1] AS last_name,
columns[2] AS birthday
FROM dfs.drillclass.`customer_data.csv`
25. Maps (Key/Value Pairs) in Drill
SELECT parse_user_agent( columns[0] ) AS ua
FROM dfs.drillclass.`user-agents.csv`
Documentation for this function is available at: https://github.com/cgivre/drill-useragent-function
26. Maps (Key/Value Pairs) in Drill
SELECT parse_user_agent( columns[0] ) AS ua
FROM dfs.drillclass.`user-agents.csv`
27. Maps (Key/Value Pairs) in Drill
SELECT parse_user_agent( columns[0] ) AS ua
FROM dfs.drillclass.`user-agents.csv`
{
"DeviceClass":"Desktop",
"DeviceName":"Macintosh",
"DeviceBrand":"Apple",
"OperatingSystemClass":"Desktop",
"OperatingSystemName":"Mac OS X",
…
"AgentName":"Chrome",
"AgentVersion":"39.0.2171.99",
"AgentVersionMajor":"39",
"AgentNameVersion":"Chrome 39.0.2171.99",
"AgentNameVersionMajor":"Chrome 39",
"DeviceCpu":"Intel"
}
29. Maps (Key/Value Pairs) in Drill
SELECT uadata.ua.OperatingSystemName AS OS_Name
FROM (
SELECT parse_user_agent( columns[0] ) AS ua
FROM dfs.drillclass.`user-agents.csv`
) AS uadata
30. In Class Exercise:
The file user-agents.csv is a small sample of a list of user agents gathered
from a server log during an attempted attack. Using this data, answer the
following questions:
1. What was the most common OS?
2. What was the most common browser?
31. In Class Exercise:
The file user-agents.csv is a small sample of a list of user agents gathered
from a server log during an attempted attack. Using this data, answer the
following questions:
1. What was the most common OS?
2. What was the most common browser?
SELECT uadata.ua.AgentNameVersion AS Browser,
COUNT( * ) AS BrowserCount
FROM (
SELECT parse_user_agent( columns[0] ) AS ua
FROM dfs.drillclass.`user-agents.csv`
) AS uadata
GROUP BY uadata.ua.AgentNameVersion
ORDER BY BrowserCount DESC
41. SELECT row_data[0] AS first_name,
row_data[1] AS last_name,
row_data[2] AS birthday
FROM
(
SELECT FLATTEN( data ) AS row_data
FROM dfs.drillclass.`split.json`
) AS split_data
48. SELECT FLATTEN( KVGEN( first_name ) )['value'] AS firstname
FROM dfs.drillclass.`columns.json`
49. SELECT first_name, last_name, birthday
FROM
(
SELECT row_number() OVER (ORDER BY ‘1') AS rownum,
FLATTEN( KVGEN(first_name))['value'] AS first_name
FROM dfs.drillclass.`columns.json`
) AS tbl1
JOIN
(
SELECT row_number() OVER (ORDER BY ‘1') AS rownum,
FLATTEN( KVGEN(last_name))['value'] AS last_name
FROM dfs.drillclass.`columns.json`
) AS tbl2
ON tbl1.rownum=tbl2.rownum
JOIN
(
SELECT row_number() OVER (ORDER BY ‘1') AS rownum,
FLATTEN( KVGEN(birthday))['value'] AS birthday
FROM dfs.drillclass.`columns.json`
) AS tbl3 ON tbl1.rownum=tbl3.rownum
57. In Class Exercise
Using the Baltimore Salaries JSON file, recreate the earlier query to
find the average salary by job title and how many people have each
job title.
HINT: Don’t forget to CAST() the columns…
HINT 2: GROUP BY does NOT support aliases.
58. In Class Exercise
Using the JSON file, recreate the earlier query to find the average
salary by job title and how many people have each job title.
SELECT raw_data[9] AS job_title,
AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary,
COUNT( DISTINCT raw_data[8] ) AS person_count
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillclass.`baltimore_salaries_2016.json`
)
GROUP BY raw_data[9]
ORDER BY avg_salary DESC
70. Networking Functions
• inet_aton( <ip> ): Converts an IPv4 Address to an integer
• inet_ntoa( <int> ): Converts an integer to an IPv4 address
• is_private(<ip>): Returns true if the IP is private
• in_network(<ip>,<cidr>): Returns true if the IP is in the CIDR block
• getAddressCount( <cidr> ): Returns the number of IPs in a CIDR block
• getBroadcastAddress(<cidr>): Returns the broadcast address of a CIDR block
• getNetmast( <cidr> ): Returns the net mask of a CIDR block
• getLowAddress( <cidr>): Returns the low IP of a CIDR block
• getHighAddress(<cidr>): Returns the high IP of a CIDR block
• parse_user_agent( <ua_string> ): Returns a map of user agent information
• urlencode( <url> ): Returns a URL encoded string
• urldecode( <url> ): Decodes a URL encoded string
71. In Class Exercise
There is a file in the repo called ‘hackers-access.httpd’ is a HTTPD
server log. Write queries to determine:
1. What is the most common browser?
2. What is the most common operating system?
72. In Class Exercise
SELECT ua.uadata.OperatingSystemNameVersion AS operating_system,
COUNT( * ) AS os_count
FROM
(
SELECT parse_user_agent(`request_user-agent`) AS uadata
FROM dfs.drillclass.`hackers-access.httpd`
) AS ua
GROUP BY ua.uadata.OperatingSystemNameVersion
ORDER BY os_count DESC
74. What if you wanted all the unique
IP addresses in your server log?
75. A Quick Demo…
SELECT DISTINCT connection_client_host
FROM dfs.drillclass.`hackers-access.httpd`
76. A Quick Demo…
SELECT DISTINCT connection_client_host
FROM dfs.drillclass.`hackers-access.httpd`
77. A Quick Demo…
SELECT DISTINCT connection_client_host
FROM dfs.drillclass.`hackers-access.httpd`
Now what if you wanted these IPs in order?
78. A Quick Demo…
SELECT DISTINCT connection_client_host
FROM dfs.drillclass.`hackers-access.httpd`
WHERE regexp_matches(`connection_client_host`, '(d{1,3}.)
{3}d{1,3}')
79. A Quick Demo…
SELECT DISTINCT connection_client_host
FROM dfs.drillclass.`hackers-access.httpd`
WHERE regexp_matches(`connection_client_host`, '(d{1,3}.)
{3}d{1,3}')
80. What if we only wanted IPs
within a certain range?
81. A Quick Demo…
SELECT DISTINCT connection_client_host
FROM dfs.drillclass.`hackers-access.httpd`
WHERE regexp_matches(`connection_client_host`, '(d{1,3}.)
{3}d{1,3}') AND
inet_aton( `connection_client_host` ) >= inet_aton( '23.94.10.8' )
AND
inet_aton( `connection_client_host` ) < inet_aton( '31.187.79.31' )
ORDER BY inet_aton( `connection_client_host` ) ASC
82. What if we wanted to know what
were the locations of IPs who
requested certain pages?
83. A Quick Demo…
SELECT getCountryName( connection_client_host ) AS ip_country,
COUNT( DISTINCT connection_client_host ) AS unique_ips
FROM dfs.drillclass.`hackers-access.httpd`
WHERE regexp_matches(`connection_client_host`, '(d{1,3}.)
{3}d{1,3}')
GROUP BY getCountryName( connection_client_host )
ORDER BY unique_ips DESC
85. Log Files
• Drill does not natively support reading log files… yet
• If you are NOT using Merlin, included in the GitHub repo are several .jar files.
Please take a second and copy them to <drill directory>/jars/3rdparty
86. Log Files
070823 21:00:32 1 Connect root@localhost on test1
070823 21:00:48 1 Query show tables
070823 21:00:56 1 Query select * from category
070917 16:29:01 21 Query select * from location
070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1
90. In Class Exercise
There is a file in the repo called ‘firewall.log’ which contains entries in the
following format:
Dec 12 03:36:23 sshd[41875]: Failed password for root from 222.189.239.10 port 1350 ssh2
Dec 12 03:36:22 sshd[41875]: Failed password for root from 222.189.239.10 port 1350 ssh2
Dec 12 03:36:22 sshlockout[15383]: Locking out 222.189.239.10 after 15 invalid attempts
Dec 12 03:36:22 sshd[41875]: Failed password for root from 222.189.239.10 port 1350 ssh2
Dec 12 03:36:22 sshlockout[15383]: Locking out 222.189.239.10 after 15 invalid attempts
Dec 12 03:36:22 sshd[42419]: Failed password for root from 222.189.239.10 port 2646 ssh2
In this exercise:
1. Write a regex to extract the date, process type, PID, from IP and any other
information you believe may be useful from this log
2. Use that regex to configure Drill to query this data.
3. Find all the records where the IP is in the CIDR block: 61.160.251.128/28
101. Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
102. Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY batting.teamID
ORDER BY hr_total DESC
MySQL: 0.047 seconds
103. Connecting other Data Sources
MySQL: 0.047 seconds
Drill: 0.366 seconds
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM mysql.stats.batting
INNER JOIN mysql.stats.teams ON batting.teamID=teams.teamID
WHERE batting.yearID = 1988 AND teams.yearID = 1988
GROUP BY teams.name
ORDER BY hr_total DESC
104.
105. {
"type": "file",
"enabled": true,
"connection": "hdfs://localhost:54310",
"config": null,
"workspaces": {
"demodata": {
"location": "/user/merlinuser/demo",
"writable": true,
"defaultInputFormat": null
}
},
Just like DFS, except you specify a link to the Hadoop namenode.
106. SELECT name, SUM( CAST( HR AS INT)) AS HR_Total
FROM hdfs.demodata.`Teams.csvh`
WHERE yearID=1988
GROUP BY name
ORDER BY HR_Total DESC
112. Connecting to Drill
drill = PyDrill(host='localhost', port=8047)
if not drill.is_active():
raise ImproperlyConfigured('Please run Drill first')
113. Connecting to Drill
query_result = drill.query('''
SELECT JobTitle,
AVG( CAST( LTRIM( AnnualSalary, '$' ) AS FLOAT) ) AS
avg_salary,
COUNT( DISTINCT name ) AS number
FROM dfs.drillclass.`*.csvh`
GROUP BY JobTitle
Order By avg_salary DESC
LIMIT 10
''')