Globalcode	– Open4education
Big Data – SQL In The Big Data Era
Rafael Aguiar
Data Science Engineer @InLocoMedia
Globalcode	– Open4education
Agenda
Contexto
Definição de Big Data
Um mapa do ecossistema
Apache Hive
Apache Hue
Por onde começar
Globalcode	– Open4education
Mobile Ad Network
baseada em localização
de alta precisão (1-3m)
Terabytes de dados
comprimidos/mês
Como entender padrões
de visita?
Como recomendar
melhores anúncios?
Globalcode	– Open4education
Big Data
“Datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.”
McKinsey (2011)
Globalcode	– Open4education
Ecossistema
Globalcode	– Open4education
Apache Hive
The Apache Hive data warehouse software
facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL.
It provides:
Tools to enable easy access to data via SQL, thus
enabling data warehousing tasks such as
extract/transform/load (ETL), reporting, and data
analysis.
A mechanism to impose structure on a variety of
data formats
Query execution via Apache Tez, Apache Spark,
or MapReduce
Globalcode	– Open4education
Apache Hive
Globalcode	– Open4education
Apache Hive
Quando usar o Hive?
Você já sabe SQL e quer começar a processar
grandes datasets sem quebrar a cabeça
Você precisa rodar um job rapidamente e não tem
tempo hábil para escrever um código limpo e
otimizado
Globalcode	– Open4education
Apache Hive
CREATE TABLE tdc_participants
(
name STRING,
age INT,
skills ARRAY <STRING>,
likes_beer BOOLEAN,
home_town STRING
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.
OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = ""
)
STORED AS TEXTFILE;
SELECT home_town, count(*)
FROM tdc_participants
WHERE array_contains(skills, "big-data")
AND likes_beer = TRUE
GROUP BY home_town;
Globalcode	– Open4education
Apache Hive
CREATE TEMPORARY FUNCTION st_linestring AS "com.esri.hadoop.hive.ST_LineString";
CREATE TEMPORARY FUNCTION st_setsrid AS "com.esri.hadoop.hive.ST_SetSRID";
CREATE TEMPORARY FUNCTION st_geodesiclengthwgs84 AS "com.esri.hadoop.hive.ST_GeodesicLengthWGS84";
CREATE TABLE location (id STRING, lat DOUBLE, lng DOUBLE, epoch BIGINT) {...};
SET hivevar:PLACE_OF_INTEREST= named_struct("lat",1.0, "lng", 1.0);
SET hivevar:MAX_DISTANCE = 10;
SET hivevar:SPATIAL_REF_ID = 4326;
SELECT count(distinct id)
From location
WHERE location.lat IS NOT NULLAND
location.lng IS NOT NULLAND
st_geodesiclengthwgs84(
st_setsrid(
st_linestring(
${hivevar:PLACE_OF_INTEREST}.lng,
${hivevar:PLACE_OF_INTEREST}.lat,
location.lng,
location.lat),
${hivevar:SPATIAL_REF_ID})) < ${hivevar:MAX_DISTANCE};
Globalcode	– Open4education
Apache Hue
http://demo.gethue.com/
Globalcode	– Open4education
Por onde começar
https://hive.apache.org/
http://gethue.com/
Programming Hive, by Edward Capriolo
https://github.com/Prokopp/the-free-hive-book
Globalcode	– Open4education
Rafael Aguiar
rafael@inlocomedia.com
@rafadaguiar
#TDCHive
Globalcode	– Open4education
Obrigado!

TDC2016SP - Trilha BigData