• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Sql user group

Sql user group






Total Views
Views on SlideShare
Embed Views



1 Embed 7

http://stef-bauer.com 7



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Sql user group Sql user group Presentation Transcript

    • “Big Data” for the SQL professional Stefan Bauer
    • stef-bauer.com/2012/12/10/you-need-a-zetta-what
    • A little about me…  Data Warehouse Administrator  Author  Architect (logical/physical)  DBA (monitoring, space management, etc)  SSIS Developer (build it… run it… support it)  SSAS/SSRS (performance tuning, supporting)  Performance monitoring (is it all working?)  I am a geek (Some people have pointed that out about me… judge for yourself)
    • What we will cover  Why do you care (or at least why you should)?  General overview  Basic terms (get us on the same page)  A Look at some of the technology (aka demo)  Elastic Map Reduce (EMR) jobflow using a hiveql script  Redshift – Starting a cluster  All of the technical parts are in a multi-part series on my Blog
    • What kind of blocks do you sort through? Interesting technology… might not be for you Getting there… might be something interesting to start working out the details… You have big data… and you know it!
    • What is that Hadoop thing I keep hearing about?  A Framework (collection of technologies)  Complex processing  Massively parallel  Large amounts of data  Commodity hardware
    • Hadoop … what is it not  Ad hoc analytics  Low latency between data arrival, analysis, and query usage  “fast” (speed is a relative thing)  Facebook has interactive queries on Hadoop framework  Good for small data
    • Terms  Cloud  Cluster  Hadoop  Hadoop Distributed File System (HDFS)  Hue (Web Interface for Mapreduce/Oozie)  Mapreduce  Job Tracker  Task Trackers (on Data Nodes)  Oozie (Workflow Management)
    • Terms  Pig (Distributed Transformation Scripting)  Beeswax (Wrapper for Hive)  Hive  EDW on (10‟s, 100‟s, 1000‟s servers)  HiveQL (Based on Ansi SQL)  Reporting Tools/Business Analytics  Name Node  Data Nodes  Zookeeper (Distributed Configuration Management)  Cloudera/MapR/Amazon/Hortonworks …
    • HDFS
    • Cloudera
    • Hive
    • Hiveql CREATE TABLE output_tbl (type string, cnt int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '${OUTPUT}' ; INSERT OVERWRITE TABLE output_tbl SELECT type_in, count(*) as cnt FROM log_table GROUP BY type_in; add jar s3://testing-royall-com/hive/libs/json-serde-1.1.6.jar; CREATE external TABLE log_table ( message_in string, level_in int, ip_in string, type_in string, timestamp_in string, id_in string, pid_in string, src_in struct<classname:string, linenumber:int> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "mapping.type_in" = "@type", "mapping.message_in" = "__message", "mapping.level_in" = "__level", "mapping.ip_in" = "__ip", "mapping.src_in" = "__src", "mapping.timestamp_in" = "@timestamp", "mapping.id_in" = "__id", "mapping.pid_in" = "__pid", "ignore.malformed.json" = "true") LOCATION '${INPUT}';
    • Hiveql ADD JAR s3://elasticmapreduce/training/lib/hive-contrib-0.8.0.jar ; CREATE EXTERNAL TABLE wikipedia ( edittime string, contributor string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH serdeproperties ( "input.regex" = ".*<revision>.*<timestamp>(.+)</timestamp>.*<contributor>.*<username >(.*)</username>.*</contributor>.*</revision>.*", "output.format.string" = "%1$s %2$s" ) LOCATION '${INPUT}' ;
    • Hiveql  Demo – Create/Run EMR  Demo – Create Redshift cluster CREATE TABLE big_contributors (contributor string, numedits int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '${OUTPUT}' ; INSERT OVERWRITE TABLE big_contributors SELECT contributor, COUNT(*) AS numedits FROM wikipedia GROUP BY contributor SORT BY numedits DESC LIMIT 20 ;
    • Redshift What is a column store anyway?
    • Compression 8k / 64K / 1Mb
    • Copy Data  From S3… (or DynamoDB) copy <table name> from 's3://<s3 file>‟ credentials 'aws_access_key_id=<yourkey>;aws_secret_access_key= <yourkey>‟ CSV delimted by „|‟;
    • Check back on the demos…
    • Questions? @stefbauer Stef_Bauer@hotmail.com Stef-Bauer.com http://spkr8.com/t/25821