SlideShare a Scribd company logo
AWS Big Data Demystified #2
Athena, Spectrum, EMR, Hive
Omid Vahdaty, Big Data Ninja
TODAY’S BIG DATA
APPLICATION STACK
PaaS and DC...
Big Data Generic Architecture | Summary
Data Collection
S3
Data Transformation
Data Modeling
Data Visualization
Agenda for Today...
● According to complexity order:
○ Athena
○ Redshift Spectrum
○ EMR
○ Hive
○ Performance TIPS
Big Data Jargon
● SQL | Schema | Database | Table | DDL
● Ad Hoc query / power Query / Concurrency
● PaaS
● External Table
● Metastore (GLUE)
● Compression
● Partition + bucketing
● SerDe (json,regex,parquet etc)
● Data Format Parquet, ORC AVRO,
● Hadoop file system [HDFS]
● Yarn
● Tez engine
● S3
Introduction AWS Athena SQL
AWS Big Data Demystified
Omid Vahdaty, Big Data Ninja
Athena Demo
● Features
● Console: quick Demo - convert to columnar
● Bugs (annoying compiler errors)
Athena & Hive Convert to Colunar example
https://amazon-aws-big-data-demystified.ninja/2018/05/15/converting-tpch-data-
from-row-based-to-columnar-via-hive-or-sparksql-and-run-ad-hoc-queries-via-
athena-on-columnar-data/
Behind the scenes
● Uses presto for queries
○ Runs in memory
○ (google documentation presto)
● Uses hive for
○ ddl function,
○ complex data types
○ Save temp results to disk
● Relies heavily on parquet
○ Compression
○ Meta data for aggregations
Concurrency
● 5 concurrent queries per AWS account
● Can be increased by support ticket.
Billing
● Canceled queries, are not billed, even if they scan data
for an hour!
● Billing is on compressed data, not uncompressed,
good for end user.
● 5$ per TB;
Connection
● Web GUI
● JDBC, but has wrapper to other languages.
● http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
● Quicksight, SQL workbench etc.
Serde
● Serde are pre installed
● All formats are supported in compression
○ Parquet - snappy is default, you can change it (decompress is fast)
○ Orc - zlib compression
○ Apache web server - server logs - RegexSerDe
Parquet vs Text
● Parquet
○ Colunar
○ Schema segregation into footer
● Text gzip = not colunar. But compressed.
Converting to columnar
● Hive
● spark/ SparkSQL
Partition
● Why?
○ Reduces costs
○ Increase performance
● How?
○ Format: Dt=2018-07-23
○ More format available:
■ /2018/07/23
■ And more
TIP: Ignore quotes in csv
CREATE EXTERNAL TABLE IF NOT EXISTS walla_mail_mime_inventory (
bucket string, Key string, VersionId string, IsLatest string, IsDeleteMarker string, Size bigint,
LastModifiedDate string, ETag string, StorageClass string, IsMultipartUploaded string, ReplicationStatus string,
EncryptionStatus string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = """
)
LOCATION 's3://bucket/';
Athena - Summary
● Use Athena when u get started
○ Ad hoc query
○ Cheap
○ Forces you to work external all the way.
● If you fully understand how to work with Athena → you understand big
data@aws
○ It will be very much the same in hive
○ It will be very much the same in sparkSQL
Introduction to AWS
Redshift Spectrum
AWS Big Data Demystified
Omid Vahdaty, Big Data Ninja
Spectrum Demo
Console - Reshift Features
Benchmark overview
Console: Demo on large data set
● Use the table created on athena
● Run a query
Working Spectrum example
https://amazon-aws-big-data-demystified.ninja/2018/05/15/want-to-select-
data-on-redshift-spectrum-which-was-created-at-athena/
External Table/Schama
● External Schema
● http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_E
XTERNAL_SCHEMA.html
● External table:
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_E
XTERNAL_TABLE.html
● Getting started:
● http://docs.aws.amazon.com/redshift/latest/dg/c-getting-
started-using-spectrum-create-role.html
● http://docs.aws.amazon.com/redshift/latest/dg/c-getting-
started-using-spectrum-create-external-table.html
Getting started
● Create schema, Make sure spectrum is available
in your region
○ create external schema spectrum
○ from data catalog
○ database 'spectrumdb'
○ iam_role
'arn:aws:iam::506754145427:role/mySpectrumRole'
○ create external database if not exists;
● Supported data type:
http://docs.aws.amazon.com/redshift/latest/dg/
r_CREATE_EXTERNAL_TABLE.html
● Bucket and cluster must be in same region
partitions?
● Manually add each partitions?
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABL
E.html
● Tip - use hive and Athena msck repair cmd.
AWS Redshift Spectrum Tips
● more cluster nodes = more spectrum slices = more performance
● smaller nodes = more concurrency
● Be Sure to understand local vs External
● Make sure you understand the data use case
● Redshift Spectrum doesn't support nested data types, such as STRUCT, ARRAY, and MAP. –> need to customize
solution for this via HIVE.
● data type conversions: string → varchar, double → double precision → done.
● DO
○ GROUP BY clauses
○ Comparison conditions and pattern-matching conditions, such as LIKE.
○ Aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX.
○ String functions.
○ to bring back a huge amount of data from S3 into Amazon Redshift to transform and process.
● DO NOT: DISTINCT and ORDER BY.
○ doesn’t support DATE as a regular data type or the DATE transform function.
○ Be careful about putting large tables that frequently join together in Amazon S3
○ CTAS is not supported from External table to External Table. i.e you can not write to external table - only
Redshift Spectrum Summary
● Spectrum →
○ requires redshift cluster
○ External Table READ ONLY! (no write)
● Work with spectrum →
○ if you have a huge hd hoc query (aggregations)
○ If want to remove some data from redshift data to s3, and later on analize it.
Introduction to EMR
AWS Big Data Demystified
Omid Vahdaty, Big Data ninja
EMR recap
● Hadoop Architecture
○ Master
○ Core
○ Task
○ HDFS
○ Yarn (container)
○ Engine: MR, TEZ
● Scale out/scale up
● Hadoop anti pattern - Join.
● AWS GLUE -
○ shared meta store…
○ And more, but not the topic for today.
EMR DEMO
● Console - how to create custom cluster
○ Show all tech options
○ GLUE (prestor, spark, hive)
○ Cofig
■ Maximize resource allocation
■ Dynamic resource allocation
■ Config Path
○ Bootstrap / step
○ Uniform instance/ fleet instances
○ Custom AMI
○ Roles
○ Show security
○ Cli to create cluster
EMR
● Tips on creating cheap cluster + performances
○ Auto scaling - based on
■ yarn available memory
■ Container Pending Ratio
○ Spots - bidding strategy
○ New instance group
○ Tasks instance with auto scaling!
○ Same size task as data node
EMR summary
● Use custom cluster
○ Get to know: maximize resource allocation
○ Experiment with all the open source options (hue,zeppelin,oozie, ganglia)
● User Glue to share meta store
● Use task nodes (even without autoscaling, u can kill it with no impact)
● When you are ready
○ Spot instances
○ Auto scaling
Introduction to Hive
AWS Big Data Demystified
Omid Vahdaty, Big data ninja
● Console
○ Hive over Hue
○ Hive over CLI
○ Hive over JDBC
● Create external table location S3 text
● Data types
● Serde
● Create external table location S3 parquet
● Json
● External table
● Convert to columnar with paritions - aws example
● Insert overwrite + dynamic partition
Hive Agenda
Hive is not...
● Not A design for OnLine Transaction Processing (OLTP)
● Not A language for real-time queries and row-level updates
Hive is...
● It stores schema in a database and
processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for
querying called HiveQL or HQL.
● Configuring Metastore means specifying
to Hive where the database is stored
Hive Architecture
JDBC
HDFS OR S3
AWS GLUE
TEZ/Spark/
Data Types
● Column Types
a. int/big int
b. Strings: char/varchar
c. Timestamp, dates
d. Decimals
e. Union : a set of of several data types
● Literals
a. Floating point, decimal point, null
● Complex Types
a. Arrays, struct, maps!
Supported file formats
● TEXTFILE (CSV, JSON)
● SEQUENCEFILE (Sequence files act as a container to store the small
files.)
○ Uncompressed key/value records.
○ Record compressed key/value records – only ‘values’ are
compressed here
○ Block compressed key/value records – both keys and values are
collected in ‘blocks’ separately and compressed. The size of the
‘block’ is configurable.
● ORC (recommend for hive, local tables, acid transactions such as
delete/update)
● Parquert(recommended for spark and External Table)
SerDe
● SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO.
The interface handles both serialization and deserialization and also interpreting
the results of serialization as individual fields for processing.
● A SerDe allows Hive to read in data from a table, and write it back out to HDFS in
any custom format. Anyone can write their own SerDe for their own data formats.
● Supported:
● Avro (Hive 0.9.1 and later), http://avro4s-ui.landoop.com
● ORC (Hive 0.11 and later)
● RegEx
● Thrift
● Parquet (Hive 0.13 and later)
● CSV (Hive 0.14 and later)
● JsonSerDe (Hive 0.12 and later in hcatalog-core)
● For Hive releases prior to 0.12, Amazon provides a JSON SerDe available at
s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar.
File format summary
● CSV/ json ---> text
● Smaller than block size → sequence file
● Analytics: ORC/Parquet columnar based
● Avro → Row based. Used for intensive write use cases
Create table as parquet - local table
CREATE TABLE parquet_test (
id int,
str string,
mp MAP<STRING,STRING>,
lst ARRAY<STRING>,
struct STRUCT<A:STRING,B:STRING>)
PARTITIONED BY (part string)
STORED AS PARQUET;
External Table
Hive tables can be created as EXTERNAL or
INTERNAL. This is a choice that affects how data is
loaded, controlled, and managed.
Use EXTERNAL tables when:
1. The data is also used outside of Hive. For
example, the data files are read and processed
by an existing program that doesn't lock the files.
2. Data needs to remain in the underlying location
even after a DROP TABLE.
Convert to Columnar
http://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
● Write to S3 bucket - if you make a typo in the bucket name, the data will be
written. Just not in S3… :)
● Create External table on source data from s3 bucket;
● You need to manage the partitions via msck, identify partitions that were
manually added to the distributed file system
● create target table on s3 as parquet
● insert data from source to destination.
● Query data on s3, as parquet , directly from Hive.
● Think files ==> not one file at a time.
Json SerDe Example
https://github.com/rcongiu/Hive-JSON-Serde
CREATE TABLE json_test1 (
one boolean,
three array<string>,
two double,
four string )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
https://github.com/quux00/hive-json-schema
How to add a serde to EMR hive?
ADD JAR /home/hadoop/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar
http://stackoverflow.com/questions/26644351/cannot-validate-serde-org-openx-data-jsonserde-jsonserde
Hive Schema From Json
https://github.com/quux00/hive-json-schema
How to add a serde to EMR hive?
ADD JAR /home/hadoop/json-serde-1.3.8-SNAPSHOT-jar-with-
dependencies.jar
How get schema from json file:
java -jar target/json-hive-schema-1.0-jar-with-
dependencies.jar file.json my_table_name
http://stackoverflow.com/questions/26644351/cannot-validate-
serde-org-openx-data-jsonserde-jsonserde
Example (schema from json)
https://amazon-aws-big-data-demystified.ninja/2018/05/17/getting-a-sql-
schema-from-json/
Nested jsons
https://aws.amazon.com/blogs/big-data/create-tables-
in-amazon-athena-from-nested-json-and-mappings-using-
jsonserde/
● Consider AVRO (works better on nested columns
than parquet)
Lateral View (explode) && inline
● https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralV
iew
● http://stackoverflow.com/questions/25088488/use-inlinearraystruct-struct-
in-hive
● http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in-
hive
● Laterl view of multiple arrays:
http://stackoverflow.com/questions/20667473/hive-explode-lateral-view-
multiple-arrays
Dynamic partition
https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions
set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE t partition (dt)
SELECT
source_ip ,
dns,
dt
FROM bbl_dns WHERE dt=current_date order by source_ip;
Why Use ORC ?
1. ORC has performance optimizations
2. ORC has transaction : delete/update
3. ORC has bucketing (index...)
4. ORC suppose to be faster than Parquet
5. Parquet might be better if you have highly nested data, because it stores its elements as a tree like
Google Dremel does (See here).
6. Apache ORC might be better if your file-structure is flattened.
7. when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for
many projects, but might be crucial for others.
Why Use Parquet
WHY NOT ORC: Couple of considerations for Parquet over ORC in Spark are:
1) Easily creation of Dataframes in spark. No need to specify schemas.
2) Worked on highly nested data.
3) works well with spark.Spark and Parquet is good combination
4. Also, ORC compression is sometimes a bit random, while Parquet compression is much more
consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It
affects both zlib and snappy compression
Why Use ORC/Parquet
Confusing parts:
https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
● Hive has a vectorized ORC reader but no vectorized parquet reader.
● Spark has a vectorized parquet reader and no vectorized ORC reader.
● Spark performs best with parquet, hive performs best with ORC.
Hive Summary
● Understand the following concepts
○ Colunar
○ External table
○ Json parsing
○ ACID tables
○ ORC/Parquet
○ Lateral View + explode
● Bottom line demystified
○ Work with external tables all the way!
○ Use parquet (for future work with spark… )
○ ACID → use ORC + Hive
○ Use Serde to parse raw data
○ Use dynamic partitions when possible (carefully)
○ Use hive to convert data to what u need - insert OVERWRITE
Big Data SQL performance Tips
@ AWS | Lesson Learned
AWS Big Data Demystified
Omid Vahdaty, Big Data Ninja
Generally speaking be familiarized with
1. When to normalize/denormalize
2. Columnar VS Row based
3. Storage types: AVRO/Parquet/ORC
4. Compression
5. Complex data types
6. When to partition? How much?
7. Processing Int is faster than string by an order of X10
8. What is the faster DB in the world? [Hint: what is your
use case?]
9. Network latency from Client to Server?
10. Encryption at rest / in motion = performance impact?
Tips for external / local tables when to use
what/which
Use local table when
1. when using only analytics DB such as redshift , and you don't need a data lake
2. when data is small and temporary insert takes time
3. When performance is everything (be prepared to pay 5 to 10 times more on external)
4. when you need to insert temporary results of your query - and there is no option to write to an external table (hive supports write to an
external table, but athena doest)
Use External table when
1. cost is an issue - use transient clusters
2. Your data is already on S3.
3. when you need several DB's for several use cases, and you want to avoid insert/ETL's
4. when you want to decouple the compute power and storage power :
5. i.e athena & spectrum - infinite compute , infinite storage. pay on what you use only.
Redshift Redshift spectrum Hive Athena
Cost High low medium low
Performance top 10 in the world. fast enough... slow... fast enough...
Syntax Postgres postgres Hive Presto
Data types
advantages
no arrays no arrays complex data types complex data types
Storage type Colunar Colunar Columnar, and Row Columnar, and Row
Usecase Joins , traditional DBMS, analytics:
Joins, AGG , order by
Aggregations ONLY, transformation , advanced parsing,
transient clusters.
ad hoc querying, not
for big data.
Anti pattern temporary cluster Joins Joins, quick and dirty, simplicity Joins / Big Data /
Inserts
Performance Tips for modeling
1. choose correct partition . [ dt? win time? ]
2. big data anti pattern - usage of join... use flat table whenever possible. easier to calculate.
3. Static Raw data (one time job as data enters) = Precalculate what you need on a daily basis =
storage is cheap...
a. lookup tables - convert to int when possible, or even boolean if exist. dont use "like"
b. datepart of wintime = can you pre calculate into fact table?
c. minimize likes
d. string to int/boolean/bit when possible.
e. case = can you pre calculate into fact table?
f. coalesce = can you pre calculate into fact table?
g. calculate group by of indexes (bluekai/gaid) values before the join job in a separate job-->
reduce running time of join.
4. Dynamic data ( recurring daily job ) compute is expensive...
a. filter data by the same time interval across all fact tables
b. filter rows not needed across all fact tables
If you must join...
1. notice the order of the tables - join - from small to big.
2. filter as much as possible
3. use only columns you must.
4. Use explain to understand the query you are writing
5. use explain to minimize raws (small table X small table = maybe equals big
table)
6. copy small tables to all data nodes (redshift/hive)
7. use hints if possible.
8. Divide the job to smaller atomic steps
Tips to avoid join
1. use flat tables with many columns - storage is cheap
2. use complex data types such as arrays, and nested arrays.
Hive tuning tips
1. Avoid order by if possible…
2. Minimize reduces.
3. Config suggested:
a. set hive.exec.parallel=true;
b. set hive.exec.parallel.thread.number=24;
c. set hive.tez.container.size=4092; (check this one carefully)
d. set hive.exec.parallel=true;
e. set hive.support.concurrency=true;
f. set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set mapred.compress.map.output=true;
set hive.execution.engine=mr;
Redshift Tips
● https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-
techniques-for-amazon-redshift/
● Distribution style
● Sort key column identified, which acts like an index in other databases,
● COMPOUND sort keys
● Caching - disk VS ram
Performance Summary
● Partitions…
● External table VS Local table
● Flat tables + complex data types VS Join
● Compression
● Columnar → Parquet
Lecture summary - starting with big data?
● Start with athena
● Have already redshift? Consider spectrum
● Use EMR Hive to transform data from any structured semistructured data to
parquet
● Fully nested? consider AVRO
Complex Q&A from the audience - post lecture notes
● When to use redshift? And when to use EMR (spark SQL, hive, presto)
○ https://amazon-aws-big-data-demystified.ninja/2018/06/03/when-should-we-emr-and-when-
to-use-redshift/
● Cost reduction on Athena:
○ https://amazon-aws-big-data-demystified.ninja/2018/06/03/cost-reduction-on-athena/
Stay in touch...
● Omid Vahdaty
● +972-54-2384178
● https://amazon-aws-big-data-demystified.ninja/
● Join our meetup, FB group and youtube channel
○ https://www.meetup.com/AWS-Big-Data-Demystified/
○ https://www.facebook.com/groups/amazon.aws.big.data.demystified/
○ https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber

More Related Content

What's hot

Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
Anirvan Chakraborty
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
Omid Vahdaty
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
Martin Zapletal
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
Datio Big Data
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
Uri Savelchev
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
Vasil Remeniuk
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and consSaniya Khalsa
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
EDB
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape
ScyllaDB
 

What's hot (20)

Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape
 

Similar to AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive

Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
Luis Marques
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Amazon Web Services
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
Chicago Hadoop Users Group
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
Accumulo Summit
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
XinliShang1
 

Similar to AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive (20)

Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
NoSQL
NoSQLNoSQL
NoSQL
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 

More from Omid Vahdaty

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
Omid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
Omid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
Omid Vahdaty
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
Omid Vahdaty
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
Omid Vahdaty
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
Omid Vahdaty
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
Omid Vahdaty
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
Omid Vahdaty
 
Multi Cloud Challanges Review
Multi Cloud Challanges ReviewMulti Cloud Challanges Review
Multi Cloud Challanges Review
Omid Vahdaty
 

More from Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Aws s3 security
Aws s3 securityAws s3 security
Aws s3 security
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to aws dynamo db
Introduction to aws dynamo dbIntroduction to aws dynamo db
Introduction to aws dynamo db
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
 
Multi Cloud Challanges Review
Multi Cloud Challanges ReviewMulti Cloud Challanges Review
Multi Cloud Challanges Review
 

Recently uploaded

MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 

Recently uploaded (20)

MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 

AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive

  • 1. AWS Big Data Demystified #2 Athena, Spectrum, EMR, Hive Omid Vahdaty, Big Data Ninja
  • 2. TODAY’S BIG DATA APPLICATION STACK PaaS and DC...
  • 3. Big Data Generic Architecture | Summary Data Collection S3 Data Transformation Data Modeling Data Visualization
  • 4. Agenda for Today... ● According to complexity order: ○ Athena ○ Redshift Spectrum ○ EMR ○ Hive ○ Performance TIPS
  • 5. Big Data Jargon ● SQL | Schema | Database | Table | DDL ● Ad Hoc query / power Query / Concurrency ● PaaS ● External Table ● Metastore (GLUE) ● Compression ● Partition + bucketing ● SerDe (json,regex,parquet etc) ● Data Format Parquet, ORC AVRO, ● Hadoop file system [HDFS] ● Yarn ● Tez engine ● S3
  • 6. Introduction AWS Athena SQL AWS Big Data Demystified Omid Vahdaty, Big Data Ninja
  • 7. Athena Demo ● Features ● Console: quick Demo - convert to columnar ● Bugs (annoying compiler errors)
  • 8. Athena & Hive Convert to Colunar example https://amazon-aws-big-data-demystified.ninja/2018/05/15/converting-tpch-data- from-row-based-to-columnar-via-hive-or-sparksql-and-run-ad-hoc-queries-via- athena-on-columnar-data/
  • 9. Behind the scenes ● Uses presto for queries ○ Runs in memory ○ (google documentation presto) ● Uses hive for ○ ddl function, ○ complex data types ○ Save temp results to disk ● Relies heavily on parquet ○ Compression ○ Meta data for aggregations
  • 10. Concurrency ● 5 concurrent queries per AWS account ● Can be increased by support ticket.
  • 11. Billing ● Canceled queries, are not billed, even if they scan data for an hour! ● Billing is on compressed data, not uncompressed, good for end user. ● 5$ per TB;
  • 12. Connection ● Web GUI ● JDBC, but has wrapper to other languages. ● http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html ● Quicksight, SQL workbench etc.
  • 13. Serde ● Serde are pre installed ● All formats are supported in compression ○ Parquet - snappy is default, you can change it (decompress is fast) ○ Orc - zlib compression ○ Apache web server - server logs - RegexSerDe
  • 14. Parquet vs Text ● Parquet ○ Colunar ○ Schema segregation into footer ● Text gzip = not colunar. But compressed.
  • 15. Converting to columnar ● Hive ● spark/ SparkSQL
  • 16. Partition ● Why? ○ Reduces costs ○ Increase performance ● How? ○ Format: Dt=2018-07-23 ○ More format available: ■ /2018/07/23 ■ And more
  • 17. TIP: Ignore quotes in csv CREATE EXTERNAL TABLE IF NOT EXISTS walla_mail_mime_inventory ( bucket string, Key string, VersionId string, IsLatest string, IsDeleteMarker string, Size bigint, LastModifiedDate string, ETag string, StorageClass string, IsMultipartUploaded string, ReplicationStatus string, EncryptionStatus string )ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = """ ) LOCATION 's3://bucket/';
  • 18. Athena - Summary ● Use Athena when u get started ○ Ad hoc query ○ Cheap ○ Forces you to work external all the way. ● If you fully understand how to work with Athena → you understand big data@aws ○ It will be very much the same in hive ○ It will be very much the same in sparkSQL
  • 19. Introduction to AWS Redshift Spectrum AWS Big Data Demystified Omid Vahdaty, Big Data Ninja
  • 20. Spectrum Demo Console - Reshift Features Benchmark overview Console: Demo on large data set ● Use the table created on athena ● Run a query
  • 22. External Table/Schama ● External Schema ● http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_E XTERNAL_SCHEMA.html ● External table: http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_E XTERNAL_TABLE.html ● Getting started: ● http://docs.aws.amazon.com/redshift/latest/dg/c-getting- started-using-spectrum-create-role.html ● http://docs.aws.amazon.com/redshift/latest/dg/c-getting- started-using-spectrum-create-external-table.html
  • 23. Getting started ● Create schema, Make sure spectrum is available in your region ○ create external schema spectrum ○ from data catalog ○ database 'spectrumdb' ○ iam_role 'arn:aws:iam::506754145427:role/mySpectrumRole' ○ create external database if not exists; ● Supported data type: http://docs.aws.amazon.com/redshift/latest/dg/ r_CREATE_EXTERNAL_TABLE.html ● Bucket and cluster must be in same region
  • 24. partitions? ● Manually add each partitions? http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABL E.html ● Tip - use hive and Athena msck repair cmd.
  • 25. AWS Redshift Spectrum Tips ● more cluster nodes = more spectrum slices = more performance ● smaller nodes = more concurrency ● Be Sure to understand local vs External ● Make sure you understand the data use case ● Redshift Spectrum doesn't support nested data types, such as STRUCT, ARRAY, and MAP. –> need to customize solution for this via HIVE. ● data type conversions: string → varchar, double → double precision → done. ● DO ○ GROUP BY clauses ○ Comparison conditions and pattern-matching conditions, such as LIKE. ○ Aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX. ○ String functions. ○ to bring back a huge amount of data from S3 into Amazon Redshift to transform and process. ● DO NOT: DISTINCT and ORDER BY. ○ doesn’t support DATE as a regular data type or the DATE transform function. ○ Be careful about putting large tables that frequently join together in Amazon S3 ○ CTAS is not supported from External table to External Table. i.e you can not write to external table - only
  • 26. Redshift Spectrum Summary ● Spectrum → ○ requires redshift cluster ○ External Table READ ONLY! (no write) ● Work with spectrum → ○ if you have a huge hd hoc query (aggregations) ○ If want to remove some data from redshift data to s3, and later on analize it.
  • 27. Introduction to EMR AWS Big Data Demystified Omid Vahdaty, Big Data ninja
  • 28. EMR recap ● Hadoop Architecture ○ Master ○ Core ○ Task ○ HDFS ○ Yarn (container) ○ Engine: MR, TEZ ● Scale out/scale up ● Hadoop anti pattern - Join. ● AWS GLUE - ○ shared meta store… ○ And more, but not the topic for today.
  • 29. EMR DEMO ● Console - how to create custom cluster ○ Show all tech options ○ GLUE (prestor, spark, hive) ○ Cofig ■ Maximize resource allocation ■ Dynamic resource allocation ■ Config Path ○ Bootstrap / step ○ Uniform instance/ fleet instances ○ Custom AMI ○ Roles ○ Show security ○ Cli to create cluster
  • 30. EMR ● Tips on creating cheap cluster + performances ○ Auto scaling - based on ■ yarn available memory ■ Container Pending Ratio ○ Spots - bidding strategy ○ New instance group ○ Tasks instance with auto scaling! ○ Same size task as data node
  • 31. EMR summary ● Use custom cluster ○ Get to know: maximize resource allocation ○ Experiment with all the open source options (hue,zeppelin,oozie, ganglia) ● User Glue to share meta store ● Use task nodes (even without autoscaling, u can kill it with no impact) ● When you are ready ○ Spot instances ○ Auto scaling
  • 32. Introduction to Hive AWS Big Data Demystified Omid Vahdaty, Big data ninja
  • 33. ● Console ○ Hive over Hue ○ Hive over CLI ○ Hive over JDBC ● Create external table location S3 text ● Data types ● Serde ● Create external table location S3 parquet ● Json ● External table ● Convert to columnar with paritions - aws example ● Insert overwrite + dynamic partition Hive Agenda
  • 34. Hive is not... ● Not A design for OnLine Transaction Processing (OLTP) ● Not A language for real-time queries and row-level updates
  • 35. Hive is... ● It stores schema in a database and processed data into HDFS. ● It is designed for OLAP. ● It provides SQL type language for querying called HiveQL or HQL. ● Configuring Metastore means specifying to Hive where the database is stored
  • 36. Hive Architecture JDBC HDFS OR S3 AWS GLUE TEZ/Spark/
  • 37. Data Types ● Column Types a. int/big int b. Strings: char/varchar c. Timestamp, dates d. Decimals e. Union : a set of of several data types ● Literals a. Floating point, decimal point, null ● Complex Types a. Arrays, struct, maps!
  • 38. Supported file formats ● TEXTFILE (CSV, JSON) ● SEQUENCEFILE (Sequence files act as a container to store the small files.) ○ Uncompressed key/value records. ○ Record compressed key/value records – only ‘values’ are compressed here ○ Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable. ● ORC (recommend for hive, local tables, acid transactions such as delete/update) ● Parquert(recommended for spark and External Table)
  • 39. SerDe ● SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing. ● A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats. ● Supported: ● Avro (Hive 0.9.1 and later), http://avro4s-ui.landoop.com ● ORC (Hive 0.11 and later) ● RegEx ● Thrift ● Parquet (Hive 0.13 and later) ● CSV (Hive 0.14 and later) ● JsonSerDe (Hive 0.12 and later in hcatalog-core) ● For Hive releases prior to 0.12, Amazon provides a JSON SerDe available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar.
  • 40. File format summary ● CSV/ json ---> text ● Smaller than block size → sequence file ● Analytics: ORC/Parquet columnar based ● Avro → Row based. Used for intensive write use cases
  • 41. Create table as parquet - local table CREATE TABLE parquet_test ( id int, str string, mp MAP<STRING,STRING>, lst ARRAY<STRING>, struct STRUCT<A:STRING,B:STRING>) PARTITIONED BY (part string) STORED AS PARQUET;
  • 42. External Table Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed. Use EXTERNAL tables when: 1. The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files. 2. Data needs to remain in the underlying location even after a DROP TABLE.
  • 43. Convert to Columnar http://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html ● Write to S3 bucket - if you make a typo in the bucket name, the data will be written. Just not in S3… :) ● Create External table on source data from s3 bucket; ● You need to manage the partitions via msck, identify partitions that were manually added to the distributed file system ● create target table on s3 as parquet ● insert data from source to destination. ● Query data on s3, as parquet , directly from Hive. ● Think files ==> not one file at a time.
  • 44. Json SerDe Example https://github.com/rcongiu/Hive-JSON-Serde CREATE TABLE json_test1 ( one boolean, three array<string>, two double, four string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; https://github.com/quux00/hive-json-schema How to add a serde to EMR hive? ADD JAR /home/hadoop/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar http://stackoverflow.com/questions/26644351/cannot-validate-serde-org-openx-data-jsonserde-jsonserde
  • 45. Hive Schema From Json https://github.com/quux00/hive-json-schema How to add a serde to EMR hive? ADD JAR /home/hadoop/json-serde-1.3.8-SNAPSHOT-jar-with- dependencies.jar How get schema from json file: java -jar target/json-hive-schema-1.0-jar-with- dependencies.jar file.json my_table_name http://stackoverflow.com/questions/26644351/cannot-validate- serde-org-openx-data-jsonserde-jsonserde
  • 46. Example (schema from json) https://amazon-aws-big-data-demystified.ninja/2018/05/17/getting-a-sql- schema-from-json/
  • 48. Lateral View (explode) && inline ● https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralV iew ● http://stackoverflow.com/questions/25088488/use-inlinearraystruct-struct- in-hive ● http://stackoverflow.com/questions/11373543/explode-the-array-of-struct-in- hive ● Laterl view of multiple arrays: http://stackoverflow.com/questions/20667473/hive-explode-lateral-view- multiple-arrays
  • 49. Dynamic partition https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions set hive.exec.dynamic.partition.mode=nonstrict INSERT OVERWRITE TABLE t partition (dt) SELECT source_ip , dns, dt FROM bbl_dns WHERE dt=current_date order by source_ip;
  • 50. Why Use ORC ? 1. ORC has performance optimizations 2. ORC has transaction : delete/update 3. ORC has bucketing (index...) 4. ORC suppose to be faster than Parquet 5. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here). 6. Apache ORC might be better if your file-structure is flattened. 7. when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for many projects, but might be crucial for others.
  • 51. Why Use Parquet WHY NOT ORC: Couple of considerations for Parquet over ORC in Spark are: 1) Easily creation of Dataframes in spark. No need to specify schemas. 2) Worked on highly nested data. 3) works well with spark.Spark and Parquet is good combination 4. Also, ORC compression is sometimes a bit random, while Parquet compression is much more consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It affects both zlib and snappy compression
  • 52. Why Use ORC/Parquet Confusing parts: https://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy ● Hive has a vectorized ORC reader but no vectorized parquet reader. ● Spark has a vectorized parquet reader and no vectorized ORC reader. ● Spark performs best with parquet, hive performs best with ORC.
  • 53. Hive Summary ● Understand the following concepts ○ Colunar ○ External table ○ Json parsing ○ ACID tables ○ ORC/Parquet ○ Lateral View + explode ● Bottom line demystified ○ Work with external tables all the way! ○ Use parquet (for future work with spark… ) ○ ACID → use ORC + Hive ○ Use Serde to parse raw data ○ Use dynamic partitions when possible (carefully) ○ Use hive to convert data to what u need - insert OVERWRITE
  • 54. Big Data SQL performance Tips @ AWS | Lesson Learned AWS Big Data Demystified Omid Vahdaty, Big Data Ninja
  • 55. Generally speaking be familiarized with 1. When to normalize/denormalize 2. Columnar VS Row based 3. Storage types: AVRO/Parquet/ORC 4. Compression 5. Complex data types 6. When to partition? How much? 7. Processing Int is faster than string by an order of X10 8. What is the faster DB in the world? [Hint: what is your use case?] 9. Network latency from Client to Server? 10. Encryption at rest / in motion = performance impact?
  • 56. Tips for external / local tables when to use what/which Use local table when 1. when using only analytics DB such as redshift , and you don't need a data lake 2. when data is small and temporary insert takes time 3. When performance is everything (be prepared to pay 5 to 10 times more on external) 4. when you need to insert temporary results of your query - and there is no option to write to an external table (hive supports write to an external table, but athena doest) Use External table when 1. cost is an issue - use transient clusters 2. Your data is already on S3. 3. when you need several DB's for several use cases, and you want to avoid insert/ETL's 4. when you want to decouple the compute power and storage power : 5. i.e athena & spectrum - infinite compute , infinite storage. pay on what you use only.
  • 57. Redshift Redshift spectrum Hive Athena Cost High low medium low Performance top 10 in the world. fast enough... slow... fast enough... Syntax Postgres postgres Hive Presto Data types advantages no arrays no arrays complex data types complex data types Storage type Colunar Colunar Columnar, and Row Columnar, and Row Usecase Joins , traditional DBMS, analytics: Joins, AGG , order by Aggregations ONLY, transformation , advanced parsing, transient clusters. ad hoc querying, not for big data. Anti pattern temporary cluster Joins Joins, quick and dirty, simplicity Joins / Big Data / Inserts
  • 58. Performance Tips for modeling 1. choose correct partition . [ dt? win time? ] 2. big data anti pattern - usage of join... use flat table whenever possible. easier to calculate. 3. Static Raw data (one time job as data enters) = Precalculate what you need on a daily basis = storage is cheap... a. lookup tables - convert to int when possible, or even boolean if exist. dont use "like" b. datepart of wintime = can you pre calculate into fact table? c. minimize likes d. string to int/boolean/bit when possible. e. case = can you pre calculate into fact table? f. coalesce = can you pre calculate into fact table? g. calculate group by of indexes (bluekai/gaid) values before the join job in a separate job--> reduce running time of join. 4. Dynamic data ( recurring daily job ) compute is expensive... a. filter data by the same time interval across all fact tables b. filter rows not needed across all fact tables
  • 59. If you must join... 1. notice the order of the tables - join - from small to big. 2. filter as much as possible 3. use only columns you must. 4. Use explain to understand the query you are writing 5. use explain to minimize raws (small table X small table = maybe equals big table) 6. copy small tables to all data nodes (redshift/hive) 7. use hints if possible. 8. Divide the job to smaller atomic steps
  • 60. Tips to avoid join 1. use flat tables with many columns - storage is cheap 2. use complex data types such as arrays, and nested arrays.
  • 61. Hive tuning tips 1. Avoid order by if possible… 2. Minimize reduces. 3. Config suggested: a. set hive.exec.parallel=true; b. set hive.exec.parallel.thread.number=24; c. set hive.tez.container.size=4092; (check this one carefully) d. set hive.exec.parallel=true; e. set hive.support.concurrency=true; f. set hive.exec.compress.output=true; set hive.exec.compress.intermediate=true; set mapred.compress.map.output=true; set hive.execution.engine=mr;
  • 62. Redshift Tips ● https://aws.amazon.com/blogs/big-data/top-10-performance-tuning- techniques-for-amazon-redshift/ ● Distribution style ● Sort key column identified, which acts like an index in other databases, ● COMPOUND sort keys ● Caching - disk VS ram
  • 63. Performance Summary ● Partitions… ● External table VS Local table ● Flat tables + complex data types VS Join ● Compression ● Columnar → Parquet
  • 64. Lecture summary - starting with big data? ● Start with athena ● Have already redshift? Consider spectrum ● Use EMR Hive to transform data from any structured semistructured data to parquet ● Fully nested? consider AVRO
  • 65. Complex Q&A from the audience - post lecture notes ● When to use redshift? And when to use EMR (spark SQL, hive, presto) ○ https://amazon-aws-big-data-demystified.ninja/2018/06/03/when-should-we-emr-and-when- to-use-redshift/ ● Cost reduction on Athena: ○ https://amazon-aws-big-data-demystified.ninja/2018/06/03/cost-reduction-on-athena/
  • 66. Stay in touch... ● Omid Vahdaty ● +972-54-2384178 ● https://amazon-aws-big-data-demystified.ninja/ ● Join our meetup, FB group and youtube channel ○ https://www.meetup.com/AWS-Big-Data-Demystified/ ○ https://www.facebook.com/groups/amazon.aws.big.data.demystified/ ○ https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber