SQLBits XI - ETL with Hadoop


Published on

My SQLBits XI presentation about Hadoop, MapReduce and Hive

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SQLBits XI - ETL with Hadoop

  1. 1. Jan Pieter Posthuma – Inter AccessETL with Hadoop and MapReduce
  2. 2. 2Introduction Jan Pieter Posthuma Technical Lead Microsoft BI andBig Data consultant Inter Access, local consultancy firm in theNetherlands Architect role at multiple projects Analysis Service, Reporting Service,PerformancePoint Service, Big Data,HDInsight, Cloud BIhttp://twitter.com/jppphttp://linkedin.com/jpposthumajan.pieter.posthuma@interaccess.nl
  3. 3. 3ExpectationsWhat to cover Simple ETL, so simplesources Different way to achieve theresultWhat not to cover Big Data Best Practices Deep internals Hadoop
  4. 4. 4Agenda Hadoop HDFS Map/Reduce– Demo Hive and Pig– Demo Polybase
  5. 5. 5Hadoop Hadoop is a collection of software to create a data-intensivedistributed cluster running on commodity hardware. Widely accepted by Database vendors as a solution forunstructured data Microsoft partners with HortonWorks and delivers theirHadoop Data Platform as Microsoft HDInsight Available on premise and as an Azure service HortonWorks Data Platform (HDP) 100% Open Source!
  6. 6. 6HadoopFastLoadSource SystemsHistorical Data(Beyond Active Window)Summarize &LoadBig Data Sources(Raw, Unstructured)Alerts, NotificationsData & Compute IntensiveApplicationERP CRM LOB APPSIntegrate/EnrichSQL ServerStreamInsightEnterprise ETL with SSIS,DQS, MDSHDInsight onWindows AzureHDInsight onWindows ServerSQL Server FTDW DataMartsSQL Server ReportingServicesSQL Server AnalysisServerBusinessInsightsInteractiveReportsPerformanceScorecardsCrawlersBotsDevicesSensorsSQL Server Parallel DataWarehouseAzure Market PlaceCREATE EXTERNAL TABLE CustomerWITH(LOCATION=„hdfs://‟, FORMAT_OPTIONS (FIELDS_TERMINATOR = „,‟)ASSELECT * FROM DimCustomer
  7. 7. 7Hadoop HDFS – distributed, fault tolerant file system MapReduce – framework for writing/executing distributed,fault tolerant algorithms Hive & Pig – SQL-like declarative languages Sqoop/PolyBase – packagefor moving data between HDFSand relational DB systems + Others…HDFSMap/ReduceHive & PigSqoop /PolybaseAvro(Serialization)HBaseZookeeperETLToolsBIReportingRDBMS
  8. 8. 8HDFSLarge File…6440MBBlock1Block2Block3Block4Block5Block6Block100Block10164MB 64MB 64MB 64MB 64MB 64MB…64MB 40MBBlock1Block2Let‟s color-code themBlock3Block4Block5Block6Block100Block101e.g., Block Size = 64MBHDFSFiles are composed of set of blocks• Typically 64MB in size• Each block is stored as a separatefile in the local file system (e.g.NTFS)
  9. 9. 9HDFSNameNode BackupNodeDataNode DataNode DataNode DataNode DataNode(heartbeat, balancing, replication, etc.)nodes write to local disknamespace backupsHDFS was designed with theexpectation that failures (bothhardware and software) wouldoccur frequently
  10. 10. 10Map/Reduce Programming framework (library and runtime) for analyzingdata sets stored in HDFS MR framework provides all the “glue” and coordinates theexecution of the Map and Reduce jobs on the cluster.– Fault tolerant– ScalableMap function:var map = function(key, value, context) {}Reduce function:var reduce = function(key, values,context) {}Map/Reduce
  11. 11. 11Map/Reduce<keyA, valuea><keyB, valueb><keyC, valuec>…<keyA, valuea><keyB, valueb><keyC, valuec>…<keyA, valuea><keyB, valueb><keyC, valuec>…<keyA, valuea><keyB, valueb><keyC, valuec>…OutputReducer<keyA, list(valuea, valueb, valuec, …)>Reducer<keyB, list(valuea, valueb, valuec, …)>Reducer<keyC, list(valuea, valueb, valuec, …)>SortandgroupbykeyDataNodeDataNodeDataNodeMapper<keyi, valuei>Mapper<keyi, valuei>Mapper<keyi, valuei>Mapper<keyi, valuei>
  12. 12. 12Demo Weather info: Need daily max and min temperature per stationvar map = function (key, value, context) {if (value[0] != #) {var allValues = value.split(,);if (allValues[7].trim() != ) {context.write(allValues[0]+-+allValues[1],allValues[0] + , + allValues[1] + , + allValues[7]);}}};Output <key, value>:<“210-19510101”, “210,19510101,-4”><“210-19510101”, “210,19510101,1”># STN,YYYYMMDD,HH, DD,FH, FF,FX, T,T10,TD,SQ, Q,DR,RH, P,VV, N, U,WW,IX, M, R, S, O, Y#210,19510101, 1,200, , 93, ,-4, , , , , , ,9947, , 8, , 5, , , , , ,210,19510101, 2,190, ,108, , 1, , , , , , ,9937, , 8, , 5, , 0, 0, 0, 0, 0
  13. 13. 13Demo (cont.)var reduce = function (key, values, context) {var mMax = -9999;var mMin = 9999;var mKey = key.split(-);while (values.hasNext()) {var mValues = values.next().split(,);mMax = mValues[2] > mMax ? mValues[2] : mMax;mMin = mValues[2] < mMin ? mValues[2] : mMin; }context.write(key.trim(),mKey[0].toString() + t +mKey[1].toString() + t +mMax.toString() + t +mMin.toString()); };Reduce Input <key, values:=list(value1, …, valuen)>:<“210-19510101”, {“210,19510101,-4”, “210,19510101,1”}>Map Output <key, value>:<“210-19510101”, “210,19510101,-4”><“210-19510101”, “210,19510101,1”>
  14. 14. Demo
  15. 15. 15Hive and PigQuery:Find the sourceIP address that generated the most adRevenue alongwith its average pageRankRankings(pageURL STRING,pageRank INT,avgDuration INT);UserVisits(sourceIP STRING,destURL STRINGvisitDate DATE,adRevenue FLOAT,.. // fields omitted);Hive & Pigpackage edu.brown.cs.mapreduce.benchmarks;import java.util.*;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;import org.apache.hadoop.mapred.lib.*;import org.apache.hadoop.fs.*;import edu.brown.cs.mapreduce.BenchmarkBase;public class Benchmark3 extends Configured implements Tool {public static String getTypeString(int type) {if (type == 1) {return ("UserVisits");} else if (type == 2) {return ("Rankings");}return ("INVALID");}/* (non-Javadoc)* @see org.apache.hadoop.util.Tool#run(java.lang.String[])*/public int run(String[] args) throws Exception {BenchmarkBase base = new BenchmarkBase(this.getConf(), this.getClass(), args);Date startTime = new Date();System.out.println("Job started: " + startTime);1// Phase #1// -------------------------------------------JobConf p1_job = base.getJobConf();p1_job.setJobName(p1_job.getJobName() + ".Phase1");Path p1_output = new Path(base.getOutputPath().toString() + "/phase1");FileOutputFormat.setOutputPath(p1_job, p1_output);//// Make sure we have our properties//String required[] = { BenchmarkBase.PROPERTY_START_DATE,BenchmarkBase.PROPERTY_STOP_DATE };for (String req : required) {if (!base.getOptions().containsKey(req)) {System.err.println("ERROR: The property " + req + " is not set");System.exit(1);}} // FORp1_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :KeyValueTextInputFormat.class);if (base.getSequenceFile()) p1_job.setOutputFormat(SequenceFileOutputFormat.class);p1_job.setOutputKeyClass(Text.class);p1_job.setOutputValueClass(Text.class);p1_job.setMapperClass(base.getTupleData() ?edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableMap.class :edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextMap.class);p1_job.setReducerClass(base.getTupleData() ?edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableReduce.class :edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextReduce.class);p1_job.setCompressMapOutput(base.getCompress());2// Phase #2// -------------------------------------------JobConf p2_job = base.getJobConf();p2_job.setJobName(p2_job.getJobName() + ".Phase2");p2_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :KeyValueTextInputFormat.class);if (base.getSequenceFile()) p2_job.setOutputFormat(SequenceFileOutputFormat.class);p2_job.setOutputKeyClass(Text.class);p2_job.setOutputValueClass(Text.class);p2_job.setMapperClass(IdentityMapper.class);p2_job.setReducerClass(base.getTupleData() ?edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TupleWritableReduce.class :edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TextReduce.class);p2_job.setCompressMapOutput(base.getCompress());// Phase #3// -------------------------------------------JobConf p3_job = base.getJobConf();p3_job.setJobName(p3_job.getJobName() + ".Phase3");p3_job.setNumReduceTasks(1);p3_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :KeyValueTextInputFormat.class);p3_job.setOutputKeyClass(Text.class);p3_job.setOutputValueClass(Text.class);//p3_job.setMapperClass(Phase3Map.class);p3_job.setMapperClass(IdentityMapper.class);p3_job.setReducerClass(base.getTupleData() ?edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TupleWritableReduce.class :edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TextReduce.class);3//// Execute #1//base.runJob(p1_job);//// Execute #2//Path p2_output = new Path(base.getOutputPath().toString() + "/phase2");FileOutputFormat.setOutputPath(p2_job, p2_output);FileInputFormat.setInputPaths(p2_job, p1_output);base.runJob(p2_job);//// Execute #3//Path p3_output = new Path(base.getOutputPath().toString() + "/phase3");FileOutputFormat.setOutputPath(p3_job, p3_output);FileInputFormat.setInputPaths(p3_job, p2_output);base.runJob(p3_job);// There does need to be a combine if (base.getCombine()) base.runCombine();return 0;}}4
  16. 16. 16Hive and Pig Principle is the same: easy data retrieval Both use MapReduce Different founders Facebook (Hive) and Yahoo (PIG) Different language SQL like (Hive) and more procedural (PIG) Both can store data in tables, which are stored as HDFS file(s) Extra language options to use benefits of Hadoop– Partition by statement– Map/Reduce statement„Of the 150k jobs Facebook runs daily, only 500 areMapReduce jobs. The rest are is HiveQL‟
  17. 17. 17HiveQuery 1: SELECT count_big(*) FROM lineitemQuery 2: SELECT max(l_quantity) FROM lineitemWHERE l_orderkey>1000 and l_orderkey<100000GROUP BY l_linestatus050010001500Query 1 Query 213181397252 279Secs.HivePDW
  18. 18. 18Demo Use the same data file as previous demo But now we directly „query‟ the file
  19. 19. Demo
  20. 20. 20Polybase PDW v2 introduces external tables to represent HDFS data PDW queries can now span HDFS and PDW data Hadoop cluster is not part of the applianceSocialAppsSensor& RFIDMobileAppsWebAppsUnstructured data Structured dataRDBMSHDFS EnhancedPDWquery engineT-SQLRelationaldatabasesSqoop /Polybase
  21. 21. PolybaseSQL ServerSQL Server SQL Server…SQL ServerPDW ClusterDN DN DNDN DN DNDN DN DNDN DN DNHadoop Cluster21This is PDW!
  22. 22. 22PDW Hadoop1. Retrieve data from HDFS with a PDW query– Seamlessly join structured and semi-structured data2. Import data from HDFS to PDW– Parallelized CREATE TABLE AS SELECT (CTAS)– External tables as the source– PDW table, either replicated or distributed, as destination3. Export data from PDW to HDFS– Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)– External table as the destination; creates a set of HDFS filesSELECT Username FROM ClickStream c, User u WHERE c.UserID = u.IDAND c.URL=„www.bing.com‟;CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)AS SELECT URL, EventDate, UserID FROM ClickStream;CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)WITH (LOCATION =„hdfs://MyHadoop:5000/joe‟, FORMAT_OPTIONS (...)AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
  23. 23. 23Recap Hadoop is the next big thing for DWH/BI Not a replacement, but an new dimension Many ways to integrate it‟s data What‟s next?– Polybase combined with (custom) Map/Reduce?– HDInsight appliance?– Polybase for SQL Server vNext?
  24. 24. 24References Microsoft BigData (HDInsight):http://www.microsoft.com/bigdata Microsoft HDInsight Azure (3 months free trail):http://www.windowsazure.com Hortonworks Data Platform sandbox (VMware):http://hortonworks.com/download/
  25. 25. Q&A
  26. 26. Coming up…Speaker Title RoomAlberto Ferrari DAX Query Engine Internals TheatreWesley Backelant An introduction to the wonderful world of OData Exhibition BBob Duffy Windows Azure For SQL folk Suite 3Dejan Sarka Excel 2013 Analytics Suite 1Mladen PrajdićFrom SQL Traces to Extended Events. The next bigswitch. Suite 2Sandip Pani New Analytic Functions in SQL server 2012 Suite 4#SQLBITS