This document summarizes key points from chapters 11 and 15 of Programming Hive. It discusses choosing compression codecs for intermediate and final outputs in Hive, how different compression schemes like LZO, Snappy, and BWT work, and enabling compression in Hive. It also covers Hive file formats like SequenceFiles, RCFiles, and ORCFiles. RCFiles store data columnarly and use RLE compression. ORCFiles provide faster reads than RCFiles. The document recommends LZO and Snappy as fast compression codecs that still achieve good compression rates.
DBD::Gofer is the scalable stateless proxy driver for Perl DBI.
These are the slides for my lightning talk on DBD::Gofer given at the Italian Perl Workshop in 2008 (with a few extra slides added).
Slides for my talk at the London Perl Workshop in Nov 2013, featuring the Devel::SizeMe perl module.
See also the screencast at https://archive.org/details/Perl-Memory-Profiling-LPW2013
DBD::Gofer is the scalable stateless proxy driver for Perl DBI.
These are the slides for my lightning talk on DBD::Gofer given at the Italian Perl Workshop in 2008 (with a few extra slides added).
Slides for my talk at the London Perl Workshop in Nov 2013, featuring the Devel::SizeMe perl module.
See also the screencast at https://archive.org/details/Perl-Memory-Profiling-LPW2013
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
Terraform 0.12 is a major milestone release that introduces many exciting and long-awaited new features. Kristin Laemmert, a Terraform core developer, covers updates to HCL (the HashiCorp Configuration Language), including: rich value types, "for" expressions, and conditional expression improvements; new features such as machine-readable output and Terraform providers in the Registry; and a glimpse of features yet to come.
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
Since Doug Cutting invented Hadoop and Amazon Web Services released S3 ten years ago, we've seen quite a bit of innovation in large-scale data storage and processing. These innovations have enabled engineers to build data infrastructure at scale, many of them fail to fill their scalable systems with useful data, struggling to unify data silos or failing to collect logs from thousands of servers and millions of containers. Fluentd and Embulk are two projects that I've been involved to solve the unsexy yet critical problem of data collection and transport. In this talk, I will give an overview of Fluentd and Embulk and give a survey of how they are used at companies like Microsoft and Atlassian or in projects like Docker and Kubernetes.
Terraform is an Infrastructure as Code tool for declaratively building and maintaining complex infrastructures on one or more cloud providers/services. But Terraform also supports over 80 non-infrastructure providers! In this demo-driven talk, will dive into the internals of Terraform and see how it works. We will show how Terraform can be used for non-infrastructure use cases by showing examples. We’ll also take a look at on how you can extend Terraform to manage anything with an API.
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
Geecon 2019 - Taming Code Quality in the Worst Language I Know: BashMichał Kordas
I don't know any other languages with more pitfalls, perils and gotchas than Bash. Still, we use it in almost every larger project for deployment or maintenance scripts, because there is no better, more powerful and more universal choice on Unix platform. However, there is ridiculous amount of things that could go wrong if you don't have deep understanding of shell scripting. Your experience about typical issues with Java or other JVM languages is definitely not enough here. You need to deeply understand Linux ecosystem and its history in order to write correct script... or you don't? I will prove to you that Bash could be tamed and made easy if proper code quality standards and static analysis tools are applied and enforced in your delivery pipelines. I'll share my opinions and experiences from a large banking project and I'll tell you which tools and style guides we use.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
Terraform 0.12 is a major milestone release that introduces many exciting and long-awaited new features. Kristin Laemmert, a Terraform core developer, covers updates to HCL (the HashiCorp Configuration Language), including: rich value types, "for" expressions, and conditional expression improvements; new features such as machine-readable output and Terraform providers in the Registry; and a glimpse of features yet to come.
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
Since Doug Cutting invented Hadoop and Amazon Web Services released S3 ten years ago, we've seen quite a bit of innovation in large-scale data storage and processing. These innovations have enabled engineers to build data infrastructure at scale, many of them fail to fill their scalable systems with useful data, struggling to unify data silos or failing to collect logs from thousands of servers and millions of containers. Fluentd and Embulk are two projects that I've been involved to solve the unsexy yet critical problem of data collection and transport. In this talk, I will give an overview of Fluentd and Embulk and give a survey of how they are used at companies like Microsoft and Atlassian or in projects like Docker and Kubernetes.
Terraform is an Infrastructure as Code tool for declaratively building and maintaining complex infrastructures on one or more cloud providers/services. But Terraform also supports over 80 non-infrastructure providers! In this demo-driven talk, will dive into the internals of Terraform and see how it works. We will show how Terraform can be used for non-infrastructure use cases by showing examples. We’ll also take a look at on how you can extend Terraform to manage anything with an API.
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
Geecon 2019 - Taming Code Quality in the Worst Language I Know: BashMichał Kordas
I don't know any other languages with more pitfalls, perils and gotchas than Bash. Still, we use it in almost every larger project for deployment or maintenance scripts, because there is no better, more powerful and more universal choice on Unix platform. However, there is ridiculous amount of things that could go wrong if you don't have deep understanding of shell scripting. Your experience about typical issues with Java or other JVM languages is definitely not enough here. You need to deeply understand Linux ecosystem and its history in order to write correct script... or you don't? I will prove to you that Bash could be tamed and made easy if proper code quality standards and static analysis tools are applied and enforced in your delivery pipelines. I'll share my opinions and experiences from a large banking project and I'll tell you which tools and style guides we use.
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Francis Alexander
The rise of NoSQL databases and their simplicity has made corporates as well as end users have started to move towards NoSQL,However is it safe?.Does NoSQL mean we will not have to worry about Injection attacks. Yes We Do. This paper concentrates on exploiting NoSQL DB’s especially with its reach towards Mongodb,Couchdb and Redis and automating it using the NoSQL Exploitation Framework.
Today's high-traffic web sites must implement performance-boosting measures that reduce data processing and reduce load on the database, while increasing the speed of content delivery. One such method is the use of a cache to temporarily store whole pages, database recordsets, large objects, and sessions. While many caching mechanisms exist, memcached provides one of the fastest and easiest-to-use caching servers. Coupling memcached with the alternative PHP cache (APC) can greatly improve performance by reducing data processing time. In this talk, Ben Ramsey covers memcached and the pecl/memcached and pecl/apc extensions for PHP, exploring caching strategies, a variety of configuration options to fine-tune your caching solution, and discusses when it may be appropriate to use memcached vs. APC to cache objects or data.
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
Tom Kraljevic discusses big data integration with Hadoop - how open source big data H2O works within a Hadoop cluster, what we've learned while integrating, and how to get the most out of your big data on Hadoop.
Development Workflow Tools for Open-Source PHP LibrariesPantheon
Having a fine-tuned continuous integration environment is extremely valuable, even for small projects. Today, there is a wide variety of standalone projects and online Software-As-A-Service offerings that can super-streamline your everyday development tasks that can help you get your projects up and running like a pro. In this session, we'll look at how you can get the most out of:
* GitHub source code repository
* Packagist package manager for Composer
* Travis CI continuous integration service
* Coveralls code coverage service
* Scrutinizer static analysis service
* Box2 phar builder
* Sami api documentation generator
* ReadTheDocs online documentation reader service
* Composer scripts and projects for running local tests and builds After mastering these tools, you will be able to quickly set up a new php library project and use it in your Drupal modules.
Session presented at Stanford Drupal Camp: https://drupalcamp.stanford.edu/development-workflow-tools-open-source-php-libraries
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangLyon Yang
This is a light training/presentation talk.
My name is Lyon Yang and I am an IoT hacker. I live in sunny Singapore where IoT is rapidly being deployed – in production. This walkthrough will aim to shed light on the subject of IoT, from finding vulnerabilities in IoT devices to getting shiny hash prompts.
Our journey starts with a holistic view of IoT security, the issues faced by IoT devices and the common mistakes made by IoT developers. Things will then get technical as we progress into a both ARM and MIPS exploitation, followed by a ‘hack-along-with-us’ workshop where you will be exploiting a commonly found IoT daemon. If you are new to IoT or a seasoned professional you will likely learn something new in this workshop.
https://www.iotvillage.org/#schedule
This is a material for "Programming in Linux" seminar for students of Seoul National University Interactive and Networked Robotics Laboratory.
Topics: Bash / Vim / GCC / Make / Git...
Author: Dongho Kang
Bhasker V Kode , (Co-Founder & CTO - Hover.in ) talking about erlang + engineering efforts at the Commercial Users of Functional Programming 2009, Edinburgh
Tom Kraljevic discusses the architecture of H2O on Hadoop and scheduling & launching long running in-memory processes on hadoop. And details of running open source H2O on Hadoop, using Yarn, and the things learned by the H2O team along the way.
5. #11 Choosing a Compression Codec
•Advantage :
•network I/O , disk space.
•Disadvantage :
•CPU overhead.
•to be short... : Trade-off
Programming Hive Reading #4 5
6. #11 Choosing a Compression Codec
•“why do we need different compression
schemes?”
•speed
•minimizing size
•‘splittable’ or not.
Programming Hive Reading #4 6
7. #11 Choosing a Compression Codec
•“why do we need different compression
schemes?”
http://comphadoop.weebly.com/
Programming Hive Reading #4 7
8. take a break : algorithm
•lossless compression
•LZ77(LZSS), LZ78, etc...
•DEFLATE (LZ77 with Huffman coding)
•LZH (LZ77 with Static Huffman coding)
•BZIP2(Burrows–Wheeler transform, Move-to-
Front, Huffman Coding)
•lossy
•for JPEG, MPEG,etc...(snip.)
Programming Hive Reading #4 8
9. take a break : algorithm
http://www.slideshare.net/moaikids/ss-2638826
Programming Hive Reading #4 9
10. take a break : algorithm
http://www.slideshare.net/moaikids/ss-2638826
Programming Hive Reading #4 10
11. take a break : algorithm
•Burrows–Wheeler Transform(BWT)
•block sorting
•“abracadabra” = bwt“ard$rcaaabb”
abracadabra$ $abracadabra a $ a
bracadabra$a a$abracadabr r a b
racadabra$ab abra$abracad d a r
acadabra$abr abracadabra$ $ a a
cadabra$abra acadabra$abr r a c
adabra$abrac adabra$abrac c a a
dabra$abraca bra$abracada a b d
abra$abracad bracadabra$a a b a
bra$abracada cadabra$abra a c b
ra$abracadab dabra$abraca a d r
a$abracadabr ra$abracadab b r a
$abracadabra racadabra$ab b r $
Programming Hive Reading #4 11
12. take a break : algorithm
•BWT with Suffix Array
•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077
•ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt
Programming Hive Reading #4 12
13. take a break : algorithm
•LZO
•“Compression is comparable in speed to
DEFLATE compression.”
•“Very fast decompression”
• http://www.oberhumer.com/opensource/lzo/
Programming Hive Reading #4 13
14. take a break : algorithm
•Google Snappy
•“very high speeds and reasonable
compression”
• https://code.google.com/p/snappy/
•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889
Programming Hive Reading #4 14
15. take a break : algorithm
•LZ4
•“very fast lossless compression algorithm”
• https://code.google.com/p/lz4/
•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4
Programming Hive Reading #4 15
16. take a break : algorithm
•“Add support for LZ4 compression”
•fix version : 0.23.1, 0.24.0,(CDH4)
•ref. https://issues.apache.org/jira/browse/HADOOP-7657
Programming Hive Reading #4 16
17. take a break : Implementation Codec
public HogeCodec implements CompressionCodec{
@Override
public CompressionOutputStream createOutputStream(OutputStream out,
Compressor compressor)
throws IOException {
return new BlockCompressorStream(out, compressor, bufferSize,
compressionOverhead);
}
@Override ref.
public Class<? extends Compressor> getCompressorType() {
return HogeCompressor.class;
http://hadoop.apache.org/
} docs/current/api/org/apache/
@Override hadoop/io/compress/
public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html
throws IOException {
return createOutputStream(out, createCompressor());
}
@Override
public Compressor createCompressor() {
return new HogeCompressor();
}
@Override
public CompressionInputStream createInputStream(InputStream in)
throws IOException {
return createInputStream(in, createDecompressor());
}
............
Programming Hive Reading #4 17
23. #11 Sequence File
•Sequence File Format
• Header
• Record
• Record length
• Key length
• Key
• Value
• A sync-marker every few 100 bytes or so.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/
SequenceFile.html
Programming Hive Reading #4 23
24. #11 Sequence File
•Compression Type
•NONE : nothing to do
•RECORD : compress on each records
•BLOCK : compress on each blocks
Programming Hive Reading #4 24
28. #15 Record Format
•TEXTFILE
•SEQUENCEFILE
•RCFILE
CREATE TABLE hoge (.
........
)
STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]
Programming Hive Reading #4 28
29. #15 Record Format
•RCFile(Record Columnar File)
•fast data loading
•fast query processing
•highly efficient storage space utilization
•a strong adaptivity to dynamic data access
patterns.
•ref. "A Fast and Space-efficient Data Placement Structure in
MapReduce-based Warehouse Systems (ICDE’11)"
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/
TR-11-4.pdf
Programming Hive Reading #4 29
30. #15 Record Format
•RCFile Format
•1 record = some Row Group
•1 HDFS Block = some Row Group
•Row Group
•a sync marker
•metadata header
•table data
•uses the RLE algorithm to compress ‘metadata
header’ section.
Programming Hive Reading #4 30
31. #15 Record Format
•Implementation of RCFile
•Input Format
•o.a.h.h.ql.io.RCFileInputFormat
•Output Format
•o.a.h.h.ql.io.RCFileOutputFormat
•SerDe
•o.a.h.h.serde2.columnar.ColumnarSerDe
Programming Hive Reading #4 31
32. #15 Record Format
•Tuning of RCFile
•“hive.io.rcfile.record.buffer.size”
•define “RowGroup” size(default: 4MB)
Programming Hive Reading #4 32
33. #15 Record Format
•ref. “HDFS and Hive storage - comparing file
formats and compression methods”
• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-
compression/
•"In term of file size, the “RCFILE” format with
the “default” and “gz” compression achieve the
best results."
•"In term of speed, the “RCFILE” formats with the
“lzo” and “snappy” are very fast while preserving
a high compression rate."
Programming Hive Reading #4 33