Sql user group

280 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
280
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sql user group

  1. 1. “Big Data” for the SQL professional Stefan Bauer
  2. 2. stef-bauer.com/2012/12/10/you-need-a-zetta-what
  3. 3. A little about me…  Data Warehouse Administrator  Author  Architect (logical/physical)  DBA (monitoring, space management, etc)  SSIS Developer (build it… run it… support it)  SSAS/SSRS (performance tuning, supporting)  Performance monitoring (is it all working?)  I am a geek (Some people have pointed that out about me… judge for yourself)
  4. 4. What we will cover  Why do you care (or at least why you should)?  General overview  Basic terms (get us on the same page)  A Look at some of the technology (aka demo)  Elastic Map Reduce (EMR) jobflow using a hiveql script  Redshift – Starting a cluster  All of the technical parts are in a multi-part series on my Blog
  5. 5. What kind of blocks do you sort through? Interesting technology… might not be for you Getting there… might be something interesting to start working out the details… You have big data… and you know it!
  6. 6. What is that Hadoop thing I keep hearing about?  A Framework (collection of technologies)  Complex processing  Massively parallel  Large amounts of data  Commodity hardware
  7. 7. Hadoop … what is it not  Ad hoc analytics  Low latency between data arrival, analysis, and query usage  “fast” (speed is a relative thing)  Facebook has interactive queries on Hadoop framework  Good for small data
  8. 8. Terms  Cloud  Cluster  Hadoop  Hadoop Distributed File System (HDFS)  Hue (Web Interface for Mapreduce/Oozie)  Mapreduce  Job Tracker  Task Trackers (on Data Nodes)  Oozie (Workflow Management)
  9. 9. Terms  Pig (Distributed Transformation Scripting)  Beeswax (Wrapper for Hive)  Hive  EDW on (10‟s, 100‟s, 1000‟s servers)  HiveQL (Based on Ansi SQL)  Reporting Tools/Business Analytics  Name Node  Data Nodes  Zookeeper (Distributed Configuration Management)  Cloudera/MapR/Amazon/Hortonworks …
  10. 10. HDFS
  11. 11. Cloudera
  12. 12. Hive
  13. 13. Hiveql CREATE TABLE output_tbl (type string, cnt int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '${OUTPUT}' ; INSERT OVERWRITE TABLE output_tbl SELECT type_in, count(*) as cnt FROM log_table GROUP BY type_in; add jar s3://testing-royall-com/hive/libs/json-serde-1.1.6.jar; CREATE external TABLE log_table ( message_in string, level_in int, ip_in string, type_in string, timestamp_in string, id_in string, pid_in string, src_in struct<classname:string, linenumber:int> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "mapping.type_in" = "@type", "mapping.message_in" = "__message", "mapping.level_in" = "__level", "mapping.ip_in" = "__ip", "mapping.src_in" = "__src", "mapping.timestamp_in" = "@timestamp", "mapping.id_in" = "__id", "mapping.pid_in" = "__pid", "ignore.malformed.json" = "true") LOCATION '${INPUT}';
  14. 14. Hiveql ADD JAR s3://elasticmapreduce/training/lib/hive-contrib-0.8.0.jar ; CREATE EXTERNAL TABLE wikipedia ( edittime string, contributor string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH serdeproperties ( "input.regex" = ".*<revision>.*<timestamp>(.+)</timestamp>.*<contributor>.*<username >(.*)</username>.*</contributor>.*</revision>.*", "output.format.string" = "%1$s %2$s" ) LOCATION '${INPUT}' ;
  15. 15. Hiveql  Demo – Create/Run EMR  Demo – Create Redshift cluster CREATE TABLE big_contributors (contributor string, numedits int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '${OUTPUT}' ; INSERT OVERWRITE TABLE big_contributors SELECT contributor, COUNT(*) AS numedits FROM wikipedia GROUP BY contributor SORT BY numedits DESC LIMIT 20 ;
  16. 16. Redshift What is a column store anyway?
  17. 17. Compression 8k / 64K / 1Mb
  18. 18. Copy Data  From S3… (or DynamoDB) copy <table name> from 's3://<s3 file>‟ credentials 'aws_access_key_id=<yourkey>;aws_secret_access_key= <yourkey>‟ CSV delimted by „|‟;
  19. 19. Check back on the demos…
  20. 20. Questions? @stefbauer Stef_Bauer@hotmail.com Stef-Bauer.com http://spkr8.com/t/25821

×