Big Data Analytics Projects - Real World with Pentaho

2,095 views
1,894 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,095
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
102
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Big Data Analytics Projects - Real World with Pentaho

  1. 1. Big Data Analytics Projects in the Real World Mark Kromer Pentaho Big Data Analytics Product Manager @mssqldude @kromerbigdata http://www.kromerbigdata.com
  2. 2. 1. The Big Data Technology Landscape 2. Big Data Analytics 3. Big Data Analytics Scenarios: ❯ Digital Marketing Analytics • Hadoop, Aster Data, SQL Server ❯ Sentiment Analysis • MongoDB, SQL Server ❯ Data Refinery • Hadoop, MPP, SQL Server, Pentaho 4. SQL Server in the Big Data world (Quasi-Real World) What we’ll (try) to cover today
  3. 3. Big Data 101 3 V’s ❯ Volume – Terabyte records, transactions, tables, files ❯ Velocity – Batch, near-time, real-time (analytics), streams. ❯ Variety – Structures, unstructured, semi-structured, and all the above in a mix Text Processing ❯ Techniques for processing and analyzing unstructured (and structured) LARGE files Analytics & Insights Distributed File System & Programming
  4. 4. Big Data ≠ NoSQL ❯ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ❯ Facebook, for example, uses Hbase from the Hadoop stack ❯ NoSQL does not have to be Big Data Big Data ≠ Real Time ❯ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ❯ Use in-memory analytics for real time insights Big Data ≠ Data Warehouse ❯ I still refer to large multi-TB DWs as “VLDB” ❯ Big Data is about crunching stats in text files for discovery of new patterns and insights ❯ Use the DW to aggregate and store the summaries of those calculations for reporting Mark’s Big Data Myths
  5. 5. • Batch Processing • Commodity Hardware • Data Locality, no shared storage • Scales linearly • Great for large text file processing, not so great on small files • Distributed programming paradigm Hadoop 1.x
  6. 6. © Hortonworks Inc. 2014 Hadoop 1 vs Hadoop 2 HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, …
  7. 7. © Hortonworks Inc. 2014 YARN: Taking Hadoop Beyond Batch Applications Run Natively in Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  8. 8. © Hortonworks Inc. 2014 Tez – Introduction 1. Distributed execution framework targeted towards data-processing applications. 2. Based on expressing a computation as a dataflow graph. 3. Highly customizable to meet a broad spectrum of use cases. 4. Built on top of YARN – the resource management framework for Hadoop. 5. Open source Apache incubator project and Apache licensed.
  9. 9. © Hortonworks Inc. 2014 Tez – Deep Dive – DAG API DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); reduce 1 map2 reduce 2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API
  10. 10. © Hortonworks Inc. 2014 YARN Eco-system Page 10 Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2
  11. 11. Apache Spark High-Speed In-Memory Analytics over Hadoop ● Open Source ● Alternative to Map Reduce for certain applications ● A low latency cluster computing system ● For very large data sets ● May be 100 times faster than Map Reduce for – Iterative algorithms – Interactive data mining ● Used with Hadoop / HDFS ● Released under BSD License
  12. 12. Popular Hadoop Distributions
  13. 13. Popular NoSQL Distributions Transactional-based, not analytical schemas
  14. 14. Popular MPP Distributions Big Data as distributed, scale-out, sharded data stores
  15. 15. Big Data Analytics Web Platform – RA 1
  16. 16. Sentiment Analysis Reference Architecture 2 Big Data Platforms Hadoop PDW MongoDB Social Media Sources Data Orchestration Data Models Analytical Models OLAP Cubes Data Mining OLAP Analytics Tools, Reporting Tools, Dashboards
  17. 17. Big Data Analytics
  18. 18. • Distributed Data (Data Locality) ❯ HDFS / MapReduce ❯ YARN / TEZ ❯ Replicated / Sharded Data • MPP Databases ❯ Vertica, Aster, PDW, Greenplum … In-database analytics that can scale-out with distributed processing across nodes • Distributed Analytics ❯ SAS: Quickly solve complex problems using big data and sophisticated analytics in a distributed, in-memory and parallel environment.” http://www.sas.com/resources/whitepaper/wp_46345.pdf • In-memory Analytics ❯ Microsoft PowerPivot (Tabular models) ❯ SAP HANA ❯ Tableau Big Data Analytics Core Tenets
  19. 19. using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } } MapReduce Framework (Map)
  20. 20. public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } } MapReduce Framework (Reduce & Job)
  21. 21. Linux shell commands to access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 Now, we can work on the data with MapReduce, Hive, Pig, etc. Get Data into Hadoop
  22. 22. create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales; Use Hive for Data Schema and Analysis
  23. 23. sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 > hadoop fs -cat /user/mark/customers/part-m-00000 > 5,Bob Smith sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records. Sqoop Data transfer to & from Hadoop & SQL Server
  24. 24. Role of NoSQL in a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
  25. 25. MongoDB and Enterprise IT Stack EDWHadoop Management&Monitoring Security&Auditing RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Online Data Offline Data
  26. 26. { _id : ObjectId("4e2e3f92268cdda473b628f6"), sourceIDs: { ABCSystemIDPart1: 8397897, ABCSystemIDPart2: 2937430, ABCSystemIDPart3: 932018 } accountType: “Checking”, accountOwners: [ { firstName : ”John", lastName: “Smith”, contactMethods: [ { type: “phone”, subtype: “mobile”, number: 8743927394}, { type: “mail”, address: “58 3rd St.”, city: …} ] possibleMatchCriteria: { govtID: 2938932432, fullName: “johnsmith”, dob: … } }, { firstName : ”Anne", maidenName: “Collins”, lastName: “Smith”, …} ], openDate: ISODate("2013-02-15 10:00”), accountFeatures { Overdraft: true, APR: 20, … } } General document per customer per account OR creditCardNumber: 8392384938391293 OR mortgageID: 2374389 OR policyID: 18374923
  27. 27. Text Search Example (e.g. address typo so do fuzzy match) // Text search for address filtered by first name and NY > db.ticks.runCommand( “text”, { search: “vanderbilt ave. vander bilt”, filter: {name: “Smith”, city: “New York”} })
  28. 28. //Find total value of each customer’s accounts for a given RM (or Agent) sorted by value db.accts.aggregate( { $match: {relationshipManager: “Smith”}}, { $group : { _id : “$ssn”, totalValue: {$sum: ”$value”} }}, { $sort: { totalValue: -1}} ) Aggregate: Total Value of Accounts
  29. 29. SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  30. 30. SQL Server Database ❯ SQL 2012 Enterprise Edition ❯ Page Compression ❯ 2012 Columnar Compression on Fact Tables ❯ Clustered Index on all tables ❯ Auto-update Stats Asynch ❯ Partition Fact Tables by month and archive data with sliding window technique ❯ Drop all indexes before nightly ETL load jobs ❯ Rebuild all indexes when ETL completes SQL Server Analysis Services ❯ SSAS 2012 Enterprise Edition ❯ 2008 R2 OLAP cubes partition-aligned with DW ❯ 2012 cubes in-memory tabular cubes ❯ All access through MSMDPUMP or SharePoint SQL Server Big Data Environment
  31. 31. SQL Server Big Data Analytics Features
  32. 32. DBA ETL/BI Developer Business Users & Executives Analysts & Data Scientists OPERATIONAL DATA BIG DATA DATA STREAMPUBLIC/PRIVATE CLOUDS Enterprise & Interactive Reporting Interactive Analysis Dashboards Predictive Analytics Pentaho Business Analytics Data Integration Instaview | Visual Map Reduce DIRECT ACCESS Pentaho Big Data Analytics
  33. 33. Pentaho Big Data Analytics Accelerate the time to big data value • Full continuity from data access to decisions – complete data integration & analytics for any big data store • Faster development, faster runtime – visual development, distributed execution • Instant and interactive analysis – no coding and no ETL required
  34. 34. Product Components Pentaho Data Integration • Visual development for big data • Broad connectivity • Data quality & enrichment • Integrated scheduling • Security integration • Visual data exploration • Ad hoc analysis • Interactive charts & visualizations Pentaho Dashboards • Self-service dashboard builder • Content linking & drill through • Highly customized mash-ups Pentaho Data Mining & Predictive Analytics • Model construction & evaluation • Learning schemes • Integration with 3rd part models using PMML Pentaho Enterprise & Interactive Reports • Both ad hoc & distributed reporting • Drag & drop interactive reporting • Pixel-perfect enterprise reports Pentaho for Big Data MapReduce & Instaview • Visual Interface for Developing MR • Self-service big data discovery • Big data access to Data Analysts Pentaho Analyzer
  35. 35. ❯ Simple, easy-to-use visual data exploration ❯ Web-based thin client; in-memory caching ❯ Rich library of interactive visualizations • Geo-mapping, heat grids, scatter plots, bubble charts, line over bar and more • Pluggable visualizations ❯ Java ROLAP engine to analyze structured and unstructured data, with SQL dialects for querying data from RDBMs ❯ Pluggable cache integrating with leading caching architectures: Infinispan (JBoss Data Grid) & Memcached Pentaho Interactive Analysis & Data Discovery Highly Flexible Advanced Visualizations

×