Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data architecture: BI and Analytics (Part 2)


Published on

All you wanted to know about big data, hadoop technologies, olap on hadoop, datawarehouse, data visualization, etc.

Published in: Technology

Big data architecture: BI and Analytics (Part 2)

  1. 1. (Big-)Data Architecture (Re-)Invented Part 2: BI and Analytics William El Kaim May 2018 – V 4.0
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex 2Copyright © William El Kaim 2018
  3. 3. Big Data Technologies Copyright © William El Kaim 2018 3
  4. 4. • What is BI on Hadoop? • What is Big Datawarehouse? • What is Big Data Analytics? • What is Big Data ecosystem for Science? • What is Visual Analytics? 4Copyright © William El Kaim 2018
  5. 5. What is BI on Hadoop? Copyright © William El Kaim 2018 5Source: Dremio
  6. 6. ETL to Data Warehouse • Pros • Relational databases and their BI integrations are very mature • Use your favorite tools • Tableau, Excel, R, … • Cons • Traditional ETL tools don’t work well with modern data • Changing schemas, complex or semi-structured data, … • Hand-coded scripts are a common substitute • Data freshness • How often do you replicate/synchronize? • Data resolution • Can’t store all the raw data in the RDBMS (due to scalability and/or cost) • Need to sample, aggregate or time-constrain the data Copyright © William El Kaim 2018 6Source: Dremio
  7. 7. Monolithic Tools Copyright © William El Kaim 2018 7 • Saagie • Jethro • Looker • Arcadia • Atscale • Datameer • Platfora • Tamr • ZoomData Source: Modified from Platfora • Single piece of software on top of Big Data • Performs both data visualization (BI) and execution • Utilize sampling or manual pre- aggregation to reduce the data volume that the user is interacting with
  8. 8. Monolithic Tools • Pros • Only one tool to learn and operate • Easier than building and maintain ETL-to-RDBMS pipeline • Integrated data preparation in some solutions • Cons • Can’t analyze the raw data • Rely on aggregation or sampling before primary analysis • Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …) • Can’t run arbitrary SQL queries Source: Dremio 8Copyright © William El Kaim 2018
  9. 9. Monolithic Tool example: Saagie Copyright © William El Kaim 2018 9
  10. 10. SQL-on-Hadoop • The combination of a familiar interface (SQL) along with a modern computing architecture (Hadoop) enables people to manipulate and query data in new and powerful ways. • There’s no shortage of SQL on Hadoop offerings, and each Hadoop distributor seems to have its preferred flavor. • Not all SQL-on-Hadoop tools are equal, so picking the right tool is a challenge. Source: Datanami & Cloudera & Dremio 10Copyright © William El Kaim 2018
  11. 11. SQL-on-Hadoop: Encoding Formats • The different encoding standards result in different block sizes, and that can impact performance. • ORC files compress smaller than Parquet files which can be a decisive choice factor. • Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while Hive and others support optimized row column (ORC) files, sequence files, or plain text. • Semi-structured data format like JSON is gaining traction • Previously Hadoop users were using MapReduce to pound unstructured data into a more structured or relational format. • Drill opened up SQL-based access directly to semi-structured data, such as JSON, which is a common format found on NoSQL and SQL databases. Cloudera also recently added support for JSON in Impala. Source: Datanami 11Copyright © William El Kaim 2018
  12. 12. SQL-on-Hadoop: Taxonomy • SQL on Hadoop tools could be categorized as • Interactive or Native SQL • Batch & Data-Science SQL • OLAP Cubes (In-memory) on Hadoop 12Copyright © William El Kaim 2018
  13. 13. SQL-on-Hadoop: Native SQL • When to use it? • Excel at executing ad-hoc SQL queries and performing self-service data exploration often used directly by data analysts or at executing the machine-generated SQL code from BI tools like Qlik and Tableau. • Latency is usually measured in seconds to minutes. • One of the key differentiator among the interactive SQL-on-Hadoop tools is how they were built. • Some of the tools, such as Impala and Drill, were developed from the beginning to run on Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran on vendors’ massively parallel processing (MPP) databases Source: Datanami 13Copyright © William El Kaim 2018
  14. 14. SQL-on-Hadoop: Native SQL • Pros • Highest performance for Big Data workloads • Connect to Hadoop and also NoSQL systems • Make Hadoop “look like a database” • Cons • Queries may still be too slow for interactive analysis on many TB/PB • Can’t defeat physics Source: Datanami & Dremio • Interactive • In 2012, Cloudera rolled out the first release of Apache Impala • MapR has been pushing the schema- less bounds of SQL querying with Apache Drill, which is based on Google‘s Dremel. • Presto (created by Facebook, now backed by Teradata) • VectorH (backed by Actian) • Apache Hawq (backed by Pivotal) • Apache Phoenix. • BigSQL (backed by IBM) • Big Data SQL (backed by Oracle) • Vertica SQL on Hadoop (backed by Hewlett-Packard). 14Copyright © William El Kaim 2018
  15. 15. SQL-on-Hadoop: Batch & Data Science SQL • When to use it? • Most often used for running big and complex jobs, including ETL and production data “pipelines,” against massive data sets. • Apache Hive is the best example of this tool category. The software essentially recreates a relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache Tez) as an intermediate processing layer. • Tools • Apache Hive, Apache Tez, Apache Spark SQL • Pros • Potentially simpler deployment (no daemons) • New YARN job (MapReduce/Spark) for each query • Check-pointing support enables very long-running queries • Days to weeks (ETL work) • Works well in tandem with machine learning (Spark) • Cons • Latency prohibitive for for interactive analytics • Tableau, Qlik Sense, … • Slower than native SQL engines Source: Datanami & Dremio 15Copyright © William El Kaim 2018
  16. 16. SQL-on-Hadoop: OLAP Cubes on Hadoop • When to use it? • Data scientists doing self-service data exploration needing performance (in milliseconds to seconds). • Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with competition in this category. • Often require an In-memory computing architecture, • Tools • Apache Kylin, Apache Lens, AtScale, Druid, Kyvos Insights • In-memory: Spark SQL, Apache Flink, Kognitio On Hadoop • Other Options To Investigate: • SnappyData (Strong SQL, In-Memory Speed, and GemfireXD history) • Apache HAWQ (Strong SQL support and Greenplum history) • Splice Machine (Now Open Source) • Hive LLAP is moving into OLAP, SQL 2011 support is growing and so is performance. • Apache Phoenix may be able to do basic OLAP with some help from Saiku OLAP BI Tool. • Most tools use Apache Calcite Source: Dremio 16Copyright © William El Kaim 2018
  17. 17. SQL-on-Hadoop: OLAP Cubes on Hadoop • Pros • Fast queries on pre-aggregated data • Can use SQL and MDX tools • Cons • Explicit cube definition/modeling phase • Not “self-service” • Frequent updates required due to dependency on business logic • Aggregation create and maintenance can be long (and large) • User connects to and interacts with the cube • Can’t interact with the raw data 17Copyright © William El Kaim 2018
  18. 18. SQL-on-Hadoop: OLAP Cubes on Hadoop • Apache Kylin lets you query massive data set at sub-second latency in 3 steps. 1. Identify a Star Schema data on Hadoop. 2. Build Cube on Hadoop. 3. Query data with ANSI-SQL and get results via ODBC, JDBC or RESTful API. Source: Apache Kylin 18Copyright © William El Kaim 2018
  19. 19. SQL-on-Hadoop: OLAP Cubes on Hadoop 19Copyright © William El Kaim 2018
  20. 20. SQL-on-Hadoop: OLAP Cubes on Hadoop Source: AtScale 20Copyright © William El Kaim 2018
  21. 21. SQL-on-Hadoop Synthesis • Pros • Continue using your favorite BI tools and SQL-based clients • Tableau, Qlik, Power BI, Excel, R, SAS, … • Technical analysts can write custom SQL queries • Cons • Another layer in your data stack • May need to pre-aggregate the data depending on your scale • Need a separate data preparation tool (or custom scripts) Copyright © William El Kaim 2018 21Source: Dremio
  22. 22. BI on Hadoop: Decision Model Copyright © William El Kaim 2018 22Source: Dremio
  23. 23. Analytics & Business Intelligence Platforms Copyright © William El Kaim 2018 23
  24. 24. Analytics & Business Intelligence Platforms • Tableau, Qlikview and Jethro (SQL Acceleration Engine for BI on Big Data compatible with BI tools like Tableau and Qlik). • Alteryx, Birst, Datawatch, Domo, GoodData, Looker, PyramidAnalytics, Saagie and ZoomData are increasingly encroaching on the territory once claimed by Qlik and Tableau. • At the same time, a new crop of Hadoop and Spark data based BI tools from the likes of Platfora, Datameer, and Clearstory Data appeared on the market. • And the old guard is still there: Sap Lumira, Microsoft PowerBI, SAS Visual Analytics • And open source tools like Datawrapper Copyright © William El Kaim 2018 24
  25. 25. Enterprise BI platforms Copyright © William El Kaim 2018 25 Enterprise BI platforms with majority cloud deployments (Q3 2017) Enterprise BI platforms with majority On-Premises deployments (Q3 2017)
  26. 26. • What is BI on Hadoop? • What is Big Datawarehouse? • What is Big Data Analytics? • What is Big Data ecosystem for Science? • What is Visual Analytics? 26Copyright © William El Kaim 2018
  27. 27. Today’s EDW challenges • Conventional data warehouse solutions were not designed to handle the volume, variety, and complexity of today's data. • Data sources are more numerous and varied, resulting in more diverse data structures that must co-exist in a single location to enable exhaustive and affordable analysis. • Traditional architectures inherently cause competition between users and data integration activities, making it difficult to simultaneously pipe new data into the data warehouse and provide users with adequate performance. • Scaling up a conventional data warehouse to meet today's increasing storage and workload demands, when possible, is expensive, painful, and slow. • The more recent, alternative data platforms are often complex, requiring specialized skills and lots of tuning and configuration. This struggle worsens when trying to handle the growing number and diversity of data sources, users, and queries. Copyright © William El Kaim 2018 27
  28. 28. Data Lake vs. Enterprise Data Warehouse • Hadoop data lakes and other big data systems capture a lot of attention and headlines these days, but data warehouses still have their place in most organizations, for supporting analysis of both current and historical data. Copyright © William El Kaim 2018 28Source: Zaloni
  29. 29. Data Lake vs. Enterprise Data Warehouse Source: Martin Fowler 29Copyright © William El Kaim 2018
  30. 30. Data Lake vs. Enterprise Data Warehouse Source: Platfora 30Copyright © William El Kaim 2018
  31. 31. New technologies for EDW • Cloud is A key factor driving the evolution of the modem data warehouse. • Access to near-infinite, low-cost storage and processing scalability • Outsourcing of data warehousing management and security to the cloud vendor; • pay for only the storage and computing resources actually used. • Massively parallel processing (MPP) • MPP emerged in the previous decade, which involves dividing a single computing operation ta execute simultaneously across a large number of separate computer processors. • This division of labor facilitates faster storage and analysis of data when software is built to capitalize on this approach. • Columnar storage: • Traditionally, databases stared records in rows, similar ta haw a spreadsheet appears. • With columnar storage, each data element of a record is stared in a column. With this approach, a user can query just one data element. Copyright © William El Kaim 2018 31
  32. 32. New technologies for EDW • Vectorized processing • This form of data processing for data analytics takes advantage of the recent and revolutionary computer chip designs. • Delivers much faster performance versus older data warehouse solutions built decades ago with older, slower hardware technology. • Solid state drives (SSDs) • Unlike hard disk drives (HDDs), SSDs store data in flash memory chips, which accelerates data storage, retrieval, and analysis. • A solution that takes advantage of SSDs can deliver significantly better performance. Copyright © William El Kaim 2018 32
  33. 33. Three types of “Big” Data Warehouses • Traditional data warehouse software deployed on cloud infrastructure • Very similar to a conventional data warehouse, as it reuses the original code base. So you still need IT expertise to build and manage the data warehouse. • While you do not have to purchase and install the hardware and software, you may still have to do significant configuration and tuning, and perform operations such as regular backups. • Traditional data warehouse hosted and managed in the cloud by a third party as a managed service • The third party provider supplies the IT expertise, but you're still likely to experience many of the same limitations of a conventional data warehouse. • The data warehouse is hosted on hardware installed in a data center managed by the vendor. This is similar to what the industry referred to as an ASP or application service provider. • A true SaaS data warehouse • Often referred to data-warehousing-as-a-service, (DWaaS), the vendor delivers a complete cloud data warehouse solution that includes ail hardware and software and the IT and database administration (DBA) expertise required. • Clients typically pay only for the storage and computing resources they use, when they use them. This option should scale up and clown on demand. Copyright © William El Kaim 2018 33
  34. 34. Big Datawarehouse Tools • Amazon Redshift • Bedrock Data Fusion • BlazingDB • Druid • ClickHouse • Google Big Query • IBM Db2 Warehouse on Cloud • Infoworks • MammothDB (part of MariaDB) • Microsoft Azure SQL Data Warehouse • Oracle Autonomous Data Warehouse Cloud Services • Panoply • Pivotal GreenPlum • Snowflake • SAP Business Warehouse (SAP BW) • Vertica Cloud Analytics Platform Copyright © William El Kaim 2018 34
  35. 35. Big Datawarehouse Tools Copyright © William El Kaim 2018 35Source: sonria
  36. 36. Big Datawarehouse Example Source: InfoworksCopyright © William El Kaim 2018 36
  37. 37. Decision Model example Copyright © William El Kaim 2018 37
  38. 38. Data Lake and EDW will coexist! Source: HortonWorks 38Copyright © William El Kaim 2018
  39. 39. • What is BI on Hadoop? • What is Big Datawarehouse? • What is Big Data Analytics? • What is Big Data ecosystem for Science? • What is Visual Analytics? 39Copyright © William El Kaim 2018
  40. 40. Data Analytics and Science Copyright © William El Kaim 2018 40
  41. 41. From Hindsight to Insight to Foresight Copyright © William El Kaim 2018 41 “ Big data, machine learning, business intelligence, predictive analytics and several other types of analytical activity serve this single purpose - to make more effective, efficient and timely decisions.” Source: Butler 2017 “The analytics layer is where intelligence is added to the data ... Until data enter this layer, they are effectively dumb, with no added value.”
  42. 42. Four types of Data Analytics Copyright © William El Kaim 2018 42
  43. 43. From Hindsight to Insight to Foresight Copyright © William El Kaim 2018 43
  44. 44. Predictive Analytics Source: wikipedia Source: Forrester 44Copyright © William El Kaim 2018
  45. 45. Finding the Best Approach Source: kdnuggets 45Copyright © William El Kaim 2018
  46. 46. Some Definitions • Artificial Intelligence (AI): Human Intelligence Exhibited by Machines • Intelligence exhibited by machines • Broadly defined to include any simulation of human intelligence • Expanding and branching areas of research, development, and investment • Includes robotics, rule-based reasoning, natural language processing (NLP), knowledge representation techniques (knowledge graphs) … • Machine Learning (ML): An Approach to Achieve Artificial Intelligence • Subfield of AI that aims to teach computers the ability to do tasks with data, without explicit programming • Uses numerical and statistical approaches, including artificial neural networks to encode learning in models • Models built using “training” computation runs or through usage • Deep Learning (DL): A Technique for Implementing Machine Learning • Subfield of ML that uses specialized techniques involving multi-layer (2+) artificial neural networks • Layering allows cascaded learning and abstraction levels (e.g. line -> shape -> object -> scene) • Computationally intensive enabled by clouds, GPUs, and specialized HW such as FPGAs, TPUs, etc. Copyright © William El Kaim 2018 46
  47. 47. What is Machine Learning? • Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. • Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. • Machine learning is ideal for exploiting the opportunities hidden in big data. Source: Rubén Casado Tejedor 47Copyright © William El Kaim 2018
  48. 48. Machine Learning: Terminology • Observations • Items or entities used for learning or evaluation (e.g., emails) • Features • Attributes (typically numeric) used to represent an observation (e.g. length, date, presence of keywords) • Labels – Values / categories assigned to observations (e.g., spam, not-spam) • Training and Test Data – Observations used to train and evaluate a learning algorithm (e.g., a set of emails along with their labels) – Training data is given to the algorithm for training while Test data is withheld at train time Source: Rubén Casado Tejedor 48Copyright © William El Kaim 2018
  49. 49. Machine Learning: Types • Supervised Learning: Learning from labelled observations • Data is tagged • Tagging may require the help of an expert to prepare the training set. • Expertise is needed before machine learning. • The challenge is about the generalization of the model • Algorithms: Classification - Regression / Prediction - Recommendation • Unsupervised Learning: Learning algorithm must find latent structure from features alone. • Output values are not known (aka. the tags, and their nature) • Some of the attributes might be not homogeneous amongst all the samples • The expertise is needed after machine learning, to interpret the results, and name the discovered categories • The challenge is about understanding the output classification • Algorithms: generally group inputs by similarities (creating clusters) • Clustering - Dimensionality Reduction - Anomaly detection 49Copyright © William El Kaim 2018
  50. 50. Machine Learning: Types Copyright © William El Kaim 2018 50 Source: Rubén Casado Tejedor Source: Louis Dorard The two phases of machine learning: • TRAIN a model • PREDICT with a model
  51. 51. Machine Learning: Use Cases • Programming computers to perform an action using example data or past experience • learn from and make predictions on data • It is used when: • Human expertise does not exist (e.g. Navigating on Mars) • Humans are unable to explain their expertise (e.g. Speech recognition) • Solution changes in time (e.g. Routing on a computer network) • Solution needs to be adapted to particular cases (e.g. User biometrics) Source: Rubén Casado Tejedor 51Copyright © William El Kaim 2018
  52. 52. Facebook’s Field Guide to Machine Learning video series Copyright © William El Kaim 2018 52Source: Facebook
  53. 53. ML Example: Wendelin Copyright © William El Kaim 2018 53
  54. 54. ML Example: Scikit-learn Copyright © William El Kaim 2018 54Source: scikit-learn
  55. 55. Machine Learning as A Service • Open Source • Accord (Dotnet), Apache Mahout, Apache Samoa, Apache Spark MLlib and Mlbase, Apache SystemML, Cloudera Oryx, GoLearn (Go), H20, Photon ML,, R Hadoop, Scikit-learn (Python), Seldon, Shogun (C++), Google TensorFlow, Weka. • Available as a Service • Algorithmia,, Amazon ML, BigML, DataRobot, FICO, Google Prediction API, HPE Haven OnDemand, IBM’s Watson Analytics, Microsoft Machine Learning Studio, PurePredictive, Predicsis, Yottamine. • Examples • BVA with Microsoft Azure ML • Quick Review of Amazon Machine Learning • BigML training Series • Handling Large Data Sets with Weka: A Look at Hadoop and Predictive Models Copyright © William El Kaim 2018 55
  56. 56. ML Example: Microsoft Azure ML Cloud Source: Microsoft 56Copyright © William El Kaim 2018
  57. 57. Big Data: Azure Machine Learning Source: Microsoft 57Copyright © William El Kaim 2018
  58. 58. What is Data Science? • Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms • Also known as Predictive or Advanced Analytics • Algorithmic and computational techniques and tools for handing large data sets • Increasingly focused on preparing and modeling data for ML & DL tasks • Encompasses statistical methods, data manipulation and streaming technologies (e.g. Spark, Hadoop) • The EDISON Data Science Framework is a collection of documents that define the Data Science profession. Copyright © William El Kaim 2018 58
  59. 59. Data Science Maturity Model Copyright © William El Kaim 2018 59Source: Domino
  60. 60. Data Science Approaches: Sampling • When to Use it? • Only data exploration / data understanding • Early prototyping on prepared and clean data • Machine Learning modeling with very few and basic patterns (e.g. only a handful of columns and binary prediction target) • When NOT to use it? • Large number of columns in the data • Need to blend large data sets (e.g. large-scale joins) • Complex Machine Learning models • Looking for rare events • Pros • Simple and easy to start with • Usually works well for data exploration and early prototyping • Some ML models would not benefit from more data anyway • Cons • Many ML models would benefit from more data • Cannot be used when large scale data preparation is needed • Hadoop is used as a data repository only • Key Process • Data Movement: Pulls sample data from HDFS/Hive/Impala • Data Processing: In the analytics tool Source: RapidMiner 60Copyright © William El Kaim 2018
  61. 61. Data Science Approaches: Grid Computing • When to Use it? • Task can be performed on smaller, independent data subsets • Compute-intensive data processing • When NOT to use it? • Data-intensive data processing • Complex Machine Learning models • Lots of interdependencies between data subsets • Pros • Hadoop is used for parallel processing in addition to using as a data source • Cons • Only works if data subsets can be processed independently • Only as good as the single-node engine, no benefit from fast-evolving Hadoop innovations • Key Process • Data Movement: Only results are moved, data remains in Hadoop • Data Processing: Custom single-node application running on multiple Hadoop nodes Source: RapidMiner 61Copyright © William El Kaim 2018
  62. 62. DataScience Approaches: Native Distributed Algorithms • When to Use it? • Complex Machine Learning models needed • Lots of interdependencies inside the data (e.g. graph analytics) • Need to blend and cleanse large data sets (e.g. large-scale joins) • When NOT to use it? • Data is not that large • Sample would reveal all interesting patterns • Pros • Holistic view of all data and patterns • Highly scalable distributed processing optimized • for Hadoop • Cons • Limited set of algorithms available, very hard to develop new algorithms • Key Process • Data Movement: Only results are moved, data remains in Hadoop • Data Processing: Executed by native Hadoop tools: Hive, Spark, H2O, Pig, MapReduce, etc. Source: RapidMiner 62Copyright © William El Kaim 2018
  63. 63. Data Science Notebooks • Altair provides a way to write declarative data visualization code in python by harnessing the power of Vegaand Vega-Lite. • Anaconda: Python data science and machine-learning • Benchling LabNotebook (SaaS) • IBM Data Science Experience ( RStudio, Jupyter and Python in a configured, collaborative environment) • LabNotebook is a tool that allows you to flexibly monitor, record, save, and query all your machine learning experiments. • Mode: R, Pytrhon, SQL Copyright © William El Kaim 2018 63
  64. 64. Data Science Notebooks: Apache Zeppelin Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data- driven, interactive and collaborative documents with SQL, Scala and more. Copyright © William El Kaim 2018 64
  65. 65. Data Science Tools Copyright © William El Kaim 2018 65 Magic Quadrant for Data Science PlatformsMagic Quadrant for Data Science and Machine-Learning Platforms
  66. 66. Data Science Integrated Platforms • Alteryx: Unified machine-learning platform • Datawatch Angoss: Analytics and Data science • Blue DME: Prescriptive analytics and Data science • Cloudera Data Science Workbench. Self-service data science for the enterprise. • Coheris Data Intelligence: Analytics and Data science • DataBricks: Unified analytics platform • Dataiku: Collaborative Data science and Machine- Learning • Domino: Collaborative Data science and Machine- Learning • DataRobot: Automated Machine Learning • Dremio. Dremio connects to your data sources directly, and supports all your favorite BI tools, and advanced languages like Python/Pandas, R, and Apache Spark. • Google Cloud Datalab, Cloud Dataprep and Google Data Studio • open-source machine-learning platform (with Auto-model) • Hypercube (from BearingPoint) • IBM SPSS (predictive and statistics models) & IBM Data Science Experience ( RStudio, Jupyter and Python in a configured, collaborative environment) & IBM Cloud Private for Data • Knime: Analytics and Data science • Mathmorks: Mathlab and Simulinks • Automated Machine Learning Platform • RapidInsights Analytics: Automated predictive modeling • RapidMiner: Unified Data Science Platform (with Auto-model) • Saagie: Automated Machine Learning Platform • SAP: Predictive Analytics platform and SAP Leonardo Machine Learning Foundation (PAAS) • TiMI: TIMi is a user-friendly GUI tools for predictive modelling Copyright © William El Kaim 2018 66
  67. 67. RapidMiner Example Source: RapidMiner 67Copyright © William El Kaim 2018
  68. 68. Scalable Data Science with R • Hadoop: Analyze data with Hadoop through R code (Rhadoop) • rhdfs to interact with HDFS systems; • rhbase to connect with Hbase; • plyrmr to perform common data transformation operations over large datasets; • rmr2 that provides a map-reduce API; • and ravro that writes and reads avro files. • Spark: with SparkR • It is possible to use Spark’s distributed computation engine to enable large-scale data analysis from the R shell. It provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc., on large data sets. • Programming with Big Data in R • Programming with Big Data in R" project (pbdr) is based on MPI and can be used on high-performance computing (HPC) systems, providing a true parallel programming environment in R. Federico Castanedo Copyright © William El Kaim 2018 68
  69. 69. Scalable Data Science with R • After the data preparation step, the next common data science phase consists of training machine learning models, which can also be performed on a single machine or distributed among different machines. • In the case of distributed machine learning frameworks, the most popular approaches using R, are the following: • Spark MLlib: through SparkR, some of the machine learning functionalities of Spark are exported in the R package. • H2o framework: a Java-based framework that allows building scalable machine learning models in R or Python. • Apache MADlib (incubating): Big Data Machine Learning in SQL Copyright © William El Kaim 2018 69
  70. 70. • What is BI on Hadoop? • What is Big Datawarehouse? • What is Big Data Analytics? • What is Big Data ecosystem for Science? • What is Visual Analytics? 70Copyright © William El Kaim 2018
  71. 71. Big Data Ecosystem For Science • Large-scale data management is essential for experimental science and has been for many years. Telescopes, particle accelerators and detectors, and gene sequencers, for example, generate hundreds of petabytes of data that must be processed to extract secrets and patterns in life and in the universe. • The data technologies used in these various science communities often predate those in the rapidly growing industry big data world, and, in many cases, continue to develop independently, occupying a parallel big data ecosystem for science, supported by the National Energy Research Scientific Computing Centre (NERSC). • Across these projects we see a common theme: data volumes are growing, and there is an increasing need for tools that can effectively store and process data at such a scale. • In some cases, the projects could benefit from big data technologies being developed in industry, and in some other projects, the research itself will lead to new capabilities. Copyright © William El Kaim 2018 71Source: Wahid Bhimji on O’Reilly
  72. 72. Big Data Ecosystem For Science Copyright © William El Kaim 2018 72Source: Wahid Bhimji on O’Reilly
  73. 73. Big Data Ecosystem For Science • Data Format • ROOT offers a self-describing binary file format with huge flexibility for serialization of complex objects and column-wise data access. • HDF5 format to enable more efficient processing of simulation output due to the parallel input/output (I/O) capabilities • Data Federation • XrootD data access protocol, which allow all of data to be accessed in a single global namespace and served up in a mechanism that is both fault-tolerant and offering high- performance. • Data Management • Big PanDA run analyses that allow thousands of collaborators to run hundreds of thousands of processing steps on exabytes of data as well as monitor and catalog that activity. Copyright © William El Kaim 2018 73
  74. 74. • What is BI on Hadoop? • What is Big Datawarehouse? • What is Big Data Analytics? • What is Big Data ecosystem for Science? • What is Visual Analytics? 74Copyright © William El Kaim 2018
  75. 75. What is Visual Analytics? • Visual Analytics • is the act of finding meaning in data using visual artifacts such as charts, graphs, maps and dashboards. • In addition, the user interface is typically driven by drag and drop actions using wholly visual constructs. Copyright © William El Kaim 2018 75
  76. 76. What is Data Visualization? • Data visualization • is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. • With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed. Copyright © William El Kaim 2018 76
  77. 77. Data Visualization Cheat Sheet Copyright © William El Kaim 2018 77
  78. 78. Dataviz Resources The Data Visualisation Catalogue Dataviz Tools Copyright © William El Kaim 2018 78
  79. 79. Data Visualization Software • Four dominant modes of analysis: descriptive (traditional BI), discovery (looking for unknown facts), predictive (finding consistent patterns that can be used in future activities), and prescriptive (actions that can be taken to improve performance). Source: ButlerAnalytics • BeyondCore, BIME, ClearStory, DOMO, GoodData, Inetsoft, InfoCaptor, Logi Analytics, Looker, Microsoft Power BI, Microstrategy, Prognoz, Qlik Sense, SAP Lumira, SAS Visual Analytics, Sisense, Spotfire, Tableau, ThoughtSpot, Yellowfin. Source: ButlerAnalytics Copyright © William El Kaim 2018 79
  80. 80. Other Data Visualization Software • For Non Developers • Apache Superset • ChartBlocks • Datawatch Panopticon • Infogram • LinPack for Tableau • Looker • Plotly • Raw • Tom Sawyer Software • • ZoomData • For Developers • D3.js, Infovis, Leaflet, NVD3, Processing.js, Recline.js, visualize.js • Chart.js, Chartist.js, Ember Charts , Google Charts, FusionCharts, Highcharts, n3-charts, Sigma JS, Polymaps • More • curated list • ProfitBricks list • Dedicated library are also available for Python, Java, C#, Scala, etc Copyright © William El Kaim 2018 80
  81. 81. Other Data Visualization Software Copyright © William El Kaim 2018 81
  82. 82. Geo-Spatial-on-Hadoop • ESRI • ESRI for Big Data • Esri GIS tools for Hadoop: Toolkit allowing developers to build analytical tools leveraging both Hadoop and Arcgis. • Esri User Defined Functions built on top of the Esri Geometry API • Pigeon: spatial extension to Pig that allows it to process spatial data. • Hive Spatial Query: adds geometric user-defined functions(UDFs) to Hive. • Geomesa • GeoMesa is an open-source, distributed, spatio-temporal database built on Accumulo, HBase, Cassandra, and Kafka. • SpatialHadoop • open source MapReduce extension designed specifically to handle huge datasets of spatial data on Apache Hadoop. • SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations. Copyright © William El Kaim 2018 82
  83. 83. Geo-Spatial-on-Hadoop • GeoDataViz • CartoDB • Deep Insights technology is capable of handling and visualizing massive amounts of contextual and time based location data. • Spatialytics • Standard geoBI platform • mapD • Leverage GPU and a dedicated NoSQL database for better performance • (Uber) • WebGL-powered framework for visual exploratory data analysis of large datasets. • Data Converter • ESRI GeoJSon Utils • GDAL: Geospatial Data Abstraction Library • Redis • Open source (BSD licensed), in- memory data structure store, used as database, cache and message broker. It supports data structures such as strings, hashes, lists, sets,sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. • Tutorial / Examples • How To Analyze Geolocation Data with Hive and Hadoop – Uber trips • Geo spatial data support for Hive using Taxi data in NYC • ESRI Wiki Copyright © William El Kaim 2018 83
  84. 84. Deck.GL Copyright © William El Kaim 2018 84
  85. 85. Copyright © William El Kaim 2018 85