Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data architecture - Introduction

20,797 views

Published on

All you wanted to know about Big Data

Published in: Technology

Big data architecture - Introduction

  1. 1. (Big-)Data Architecture (Re-)Invented Part-1 William El Kaim Dec. 2016 – V3.3
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex http://www.eacodex.com/Copyright © William El Kaim 2016 2
  3. 3. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 3
  4. 4. Taming the Data Deluge Source: DomoCopyright © William El Kaim 2016 4
  5. 5. Copyright © William El Kaim 2016 5
  6. 6. Taming the Data Deluge Copyright © William El Kaim 2016 6
  7. 7. Taming the Data Deluge Copyright © William El Kaim 2016 7
  8. 8. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 8
  9. 9. What is Big Data? • A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • Due to its technical nature, the same challenges arise in Analytics at much lower volumes than what is traditionally considered Big Data. • Big Data Analytics is: • The same as ‘Small Data’ Analytics, only with the added challenges (and potential) of large datasets (~50M records or 50GB size, or more) • Challenges : • Data storage and management • De-centralized/multi-server architectures • Performance bottlenecks, poor responsiveness • Increasing hardware requirements Source: SiSenseCopyright © William El Kaim 2016 9
  10. 10. Copyright © William El Kaim 2016 10
  11. 11. What is Big Data: the “Vs” to Nirvana Visualization Source: James Higginbotham Big Data: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577. Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw away (David Brower @dbrower) Copyright © William El Kaim 2016 11
  12. 12. Six V to Nirvana Source: Bernard MarrCopyright © William El Kaim 2016 12
  13. 13. Six V to Nirvana Source: IBMCopyright © William El Kaim 2016 13
  14. 14. Six V to Nirvana Source: IBMCopyright © William El Kaim 2016 14
  15. 15. Six V to Nirvana Source: IBMCopyright © William El Kaim 2016 15
  16. 16. Six V to Nirvana Source: IBMCopyright © William El Kaim 2016 16
  17. 17. Six V to Nirvana Source: Bernard Marr Visualization Copyright © William El Kaim 2016 17
  18. 18. Big Data Challenges • Data access, cleansing and categorization • Data storage and management • De-centralized/multi-server architectures management • Performance bottlenecks, poor responsiveness, crash • Hardware requirements • (Big) Data management tools like Cluster mgt, multi-region deployment, etc. • Non stable software distributions, rapid evolving fields • Missing Skills Copyright © William El Kaim 2016 18
  19. 19. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 19
  20. 20. Big Data: Why Now ? Copyright © William El Kaim 2016 20
  21. 21. Datafication of our World Source: Bernard Marr Copyright © William El Kaim 2016 21
  22. 22. New Databases Source: Robin Purohit Copyright © William El Kaim 2016 22
  23. 23. New Databases Copyright © William El Kaim 2016 23
  24. 24. Data Science (R)evolution Source: Capgemini Copyright © William El Kaim 2016 24
  25. 25. Open APIs -> Platforms -> Bus. Ecosystems Copyright © William El Kaim 2016 25
  26. 26. All in one Ready To Use Solutions Copyright © William El Kaim 2016 26http://web.panoply.io/
  27. 27. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 27
  28. 28. Hadoop Genesis • Scalability issue when running jobs processing terabytes of data • Could span dozen days just to read that amount of data on 1 computer • Need lots of cheap computers • To Fix speed problem • But lead to reliability problems • In large clusters, computers fail every day • Cluster size is not fixed • Need common infrastructure • Must be efficient and reliable Copyright © William El Kaim 2016 28
  29. 29. Hadoop Genesis Apache Nutch Doug Cutting “Map-reduce” 2004 “It is an important technique!” Extended Source: Xiaoxiao ShiCopyright © William El Kaim 2016 29
  30. 30. Hadoop Genesis “Hadoop came from the name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.” Doug Cutting, Hadoop project creator • Open Source Apache Project • Written in Java • Running on Commodity hardware and all major OS • Linux, Mac OS/X, Windows, and Solaris Copyright © William El Kaim 2016 30
  31. 31. Enters Hadoop the Big Data Refinery • Hadoop is not replacing anything. • Hadoop has become another component in an organizations enterprise data platform. • Hadoop (Big Data Refinery) can ingest data from all types of different sources. • Hadoop then interacts and has data flows with traditional systems that provide transactions and interactions (relational databases) and business intelligence and analytic systems (data warehouses). Source: DBA Journey BlogCopyright © William El Kaim 2016 31
  32. 32. Hadoop Platform Source: Octo TechnologyCopyright © William El Kaim 2016 32
  33. 33. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 33
  34. 34. When to Use Hadoop? Two Main Use Cases 2 1 Data Mgt. And Storage Big Data Analytics and Use Copyright © William El Kaim 2016 34
  35. 35. When to Use Hadoop? Source: Octo Technology 1 2 Data Mgt and Storage Data Analytics and Use 1 2 Copyright © William El Kaim 2016 35
  36. 36. Hadoop for Data Mgt and Storage ETL Pre-processor ETL Pre-processor • Shift the pre-processing of ETL in staging data warehouse to Hadoop • Shifts high cost data warehousing to lower cost Hadoop clusters Source: Microsoft 1 Copyright © William El Kaim 2016 36
  37. 37. Hadoop for Data Mgt and Storage Massive Storage Massive Storage • Offloading large volume of historical data into cold storage with Hadoop • Keep data warehouse for hot data to allow BI and analytics • When data from cold storage is needed, it can be moved back into the warehouse Source: Microsoft 1 Copyright © William El Kaim 2016 37
  38. 38. Hadoop for Data Analytics and Use Six V to Nirvana to Provide • Hindsight (what happened?) • Oversight (what is happening?) • Insight (why is it happening?) • Foresight (what will happen?) 2 Copyright © William El Kaim 2016 38
  39. 39. Hadoop for Data Analytics and Use From Hindsight to Insight to Foresight • Traditional analytics tools are not well suited to capturing the full value of big data. • The volume of data is too large for comprehensive analysis. • The range of potential correlations and relationships between disparate data sources are too great for any analyst to test all hypotheses and derive all the value buried in the data. • Basic analytical methods used in business intelligence and enterprise reporting tools reduce to reporting sums, counts, simple averages and running SQL queries. • Online analytical processing is merely a systematized extension of these basic analytics that still rely on a human to direct activities specify what should be calculated. 2 Copyright © William El Kaim 2016 39
  40. 40. Hadoop for Data Analytics and Use From Hindsight to Insight to Foresight 2 Copyright © William El Kaim 2016 40
  41. 41. Hadoop for Data Analytics and Use From Data Management To Data Driven Decisions • Machine Learning: Data Reliability For Big Data • In big data world bringing data together from multiple internal and external sources can be a challenge. New approach to moving from manual or rules-based matching to matching done via machine learning. • The initial phase of machine learning being then to provide transparency of the actual rules that drive the merge and then it is up to the user to evaluate the discovered rule and persist it in the system. • Graph: Finding Relationships In The Data • Graph is used to helping understand and navigate many-to-many relationships across all data entities. • Cognitive Systems: Intelligent Recommendations • Building intelligent systems that guide users and provide intelligent recommendations, based on data and user behavior. 2 Source: ReltioCopyright © William El Kaim 2016 41
  42. 42. Hadoop for Data Analytics and Use From Data Management To Data Driven Decisions • Collaborative Curation: Clean & Current Data • Sharing data across all systems and functional groups helps realize the full value of data collected. Marketing, sales, services, and support should all leverage the same reliable, consolidated data. They should be able to collaborate and contribute to enriching the data. They should be able to vote on data quality or the business impact of any data entity. New data-driven applications must support this. • Data Monetization & DaaS: New Revenue Streams • Charter of a CDO is not only about data governance and data integration and management. Increasingly, companies are asking CDOs to turn this data into new revenue streams. With cloud-based data-as-a-service, companies can monetize their data and become data brokers. Businesses can now collaborate with each other to create common data resources and easily share or exchange data. 2 Source: ReltioCopyright © William El Kaim 2016 42
  43. 43. Hadoop for Data Analytics and Use Data Discovery Data Discovery • Keep data warehouse for operational BI and analytics • Allow data scientists to gain new discoveries on raw data (no format or structure) • Operationalize discoveries back into the warehouse Source: Microsoft 2 Copyright © William El Kaim 2016 43
  44. 44. Hadoop for Data Analytics and Use Skills Needed 2 Copyright © William El Kaim 2016 44
  45. 45. When Not to Use Hadoop • Real Time Analytics (Solved in Hadoop V2) • Since Hadoop V1 cannot be used for real time analytics, people explored and developed a new way in which they can use the strength of Hadoop (HDFS) and make the processing real time. So, the industry accepted way is to store the Big Data in HDFS and mount Spark over it. By using spark the processing can be done in real time and in a flash (real quick). Apache Kudu is also a complementary solution to use. • To Replace Existing Infrastructure • All the historical big data can be stored in Hadoop HDFS and it can be processed and transformed into a structured manageable data. After processing the data in Hadoop you often need to send the output to other database technologies for BI, decision support, reporting etc. • Small Datasets • Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop. For a small data analytics, Hadoop can be costlier than other tools. Source: EdurekaCopyright © William El Kaim 2016 45
  46. 46. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 46
  47. 47. How to Implement Hadoop? Typical Project Development Steps Dashboards, Reports, Visualization, … CRM, ERP Web, Mobile Point of sale Big Data Platform Business Transactions & Interactions Business Intelligence & Analytics Unstructured Data Log files DB data Exhaust Data Social Media Sensors, devices Classic Data Integration & ETL Capture Big Data Collect data from all sources structured &unstructured Process Transform, refine, aggregate, analyze, report Distribute Results Interoperate and share data with applications/analytics Feedback Use operational data w/in the big data platform 1 2 3 4 Source: HortonWorksCopyright © William El Kaim 2016 47
  48. 48. How to Implement Hadoop? Typical Three Steps Process Source: HortonWorks Collect Explore Enrich Copyright © William El Kaim 2016 48
  49. 49. How to Implement Hadoop? Typical Project Development Steps 1. Data Source Layer 3. Data Processing / Analysis Layer 2. Data Storage Layer 4. Data Output Layer LakeshoreRaw Data Data Lake Bus. or Usage oriented extractions Data Science Correlations, Analytics, Machine Learning Data Visualization & Bus. intelligence Internal Data External Data 1 2 3 3 4 Copyright © William El Kaim 2016 49
  50. 50. How to Implement Hadoop? Dataiku DSS Free Community Edition (mac, linux, docker, aws) http://www.dataiku.com/dss/Copyright © William El Kaim 2016 50
  51. 51. How to Implement Hadoop? Source: Cirrus ShakeriCopyright © William El Kaim 2016 51
  52. 52. How to Implement Hadoop? CRISP Methodology Source: The Modeling Agency and sv-europeCopyright © William El Kaim 2016 52
  53. 53. How to Implement Hadoop? McKinsey Seven Steps Approach Copyright © William El Kaim 2016 53Source: McKinsey
  54. 54. How to Implement Hadoop? IBM DataFirst Source: IBMCopyright © William El Kaim 2016 54
  55. 55. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 55
  56. 56. EDW Current Limitations • Organizations are realizing that traditional EDW technologies can’t meet their new business needs • Including leveraging streaming and social data from the Web or from connected devices on the Internet of things (IoT) • One major challenge with traditional EDWs is their schema-on-write architecture • Enterprises must design the data model and articulate the analytic frameworks before loading any data. • In other words, they need to know ahead of time how they plan to use that data. Source: ZaloniCopyright © William El Kaim 2016 56
  57. 57. Hadoop to the Rescue • Extract and place data into a Hadoop-based repository without first transforming the data the way they would for a traditional EDW. • All frameworks are created in an ad hoc manner, with little or no prep work required. • labor intensive processes of cleaning up data and developing schema are deferred until a clear business need is identified • Hadoop can be 10 to 100 times less expensive to deploy than traditional data warehouse. Source: ZaloniCopyright © William El Kaim 2016 57
  58. 58. Enters the Data Lake • A data lake is: • Typically built using Hadoop. • Supports structured or unstructured data. • Benefiting from a variety of storage and processing tools to extract value quickly. • Requiring little or no processing for adapting the structure to an enterprise schema • A central location in which to store all your data in its native form, regardless of its source or format. Source: ZaloniCopyright © William El Kaim 2016 58
  59. 59. Who is Using Data Lakes? Source: PWCCopyright © William El Kaim 2016 59
  60. 60. Data Flow In The Data Lake Source: PWCCopyright © William El Kaim 2016 60
  61. 61. Schema On Read or Schema On Write? • The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” • But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. • And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. • At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. Source: AdaptiveSystemsCopyright © William El Kaim 2016 61
  62. 62. Schema On Write … • Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. • Irrelevant data is discarded, data types, lengths and positions are all delineated. • The schema; the columns, rows, tables and relationships are all defined first for the specific purpose that database will serve. • Then the data is filled into its pre-defined positions. The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as ETL (Extract Transform Load). • That is why it is called “schema on write” because the data structure is already defined when the data is written and stored. • For a very long time, it was believed that this was the only right way to manage data. Source: AdaptiveSystemsCopyright © William El Kaim 2016 62
  63. 63. Schema On Write … • Benefits: Quality and Query Speed. • Because the data structure is defined ahead of time, when you query, you know exactly where your data is. • The structure is generally optimized for the fastest possible return of data for the types of questions the data store was designed to answer (write very simple SQL and get back very fast answers). • The answers received from querying data are sharply defined, precise and trustworthy, with little margin for error. • Drawbacks: Data Alteration & Query Limitations • Data has been altered and structured specifically to serve a specific purpose. Chances are high that, if another purpose is found for that data, the data store will not suit it well. • ETL processes and validation rules are then needed to clean, de-dupe, check and transform that data to match pre-defined format. Those processes take time to build, time to execute, and time to alter if you need to change it to suit a different purpose. Source: AdaptiveSystemsCopyright © William El Kaim 2016 63
  64. 64. Schema On Read … • Revolutionary concept: “You don’t have to know what you’re going to do with your data before you store it.” • Data of many types, sizes, shapes and structures can all be thrown into the Hadoop Distributed File System, and other Hadoop data storage systems. • While some metadata needs to be stored, to know what’s in there, no need yet to know how it will be structured! • Therefore, the data is stored in its original granular form, with nothing thrown away • In fact, no structural information is defined at all when the data is stored. • So “schema on read” implies that the schema is defined at the time the data is read and used, not at the time that it is written and stored. • When someone is ready to use that data, then, at that time, they define what pieces are essential to their purpose, where to find those pieces of information that matter for that purpose, and which pieces of the data set to ignore. Source: AdaptiveSystemsCopyright © William El Kaim 2016 64
  65. 65. Schema On Read … • Benefits: Flexibility and Query Power • Because data is stored in its original form, nothing is discarded, or altered for a specific purpose. • Different types of data generated by different sources can be stored in the same place. This allows you to query multiple data stores and types at once. • The heart of the Hadoop data lake concept which puts all available data sets in their original form in a single location such a potent one. • Drawbacks: Inaccuracies and Slow Query Speed • Since the data is not subjected to rigorous ETL and data cleansing processes, nor does it pass through any validation, data may be riddled with missing or invalid data, duplicates and a bunch of other problems that may lead to inaccurate or incomplete query results. • In addition, since the structure must be defined when the data is queried, the SQL queries tend to be very complex. They take time to write, and even more time to execute. Source: AdaptiveSystemsCopyright © William El Kaim 2016 65
  66. 66. Some Examples in the Hadoop Ecosystem • Drill is probably the best example of a pure schema on read SQL engine in the Hadoop ecosystem today. • It gives you the power to query a broad set of data, from a wide variety of different data stores, including hierarchical data such as JSON and XML. • Hive is the original schema on read technology, but is, in fact, a marvelous hybrid of the two technologies. In order for Hive to gain the advantages of a schema on write data store, ORC file format was created. This is a pre- structured format optimized for Hive queries. By combining strategies, Hive has gained many of the advantages of both camps. • Spark SQL is entirely a schema on write technology, leveraging a data processing engine that can do high speed ETL processes entirely in memory. • Other SQL on write options: Impala with Parquet, Actian Vortex with Vector in Hadoop, IBM Big SQL with BigInsights, HAWQ with Greenplum, etc. Source: AdaptiveSystemsCopyright © William El Kaim 2016 66
  67. 67. Schema On Read or Schema On Write? • Schema on read options tend to be a better choice: • for exploration, for “unknown unknowns,” when you don’t know what kind of questions you might want to ask, or the kinds of questions might change over time. • when you don’t have a strong need for immediate responses. They’re ideal for data exploration projects, and looking for new insights with no specific goal in mind. • Schema on write options tend to be very efficient for “known unknowns.” • When you know what questions you’re going to need to ask, especially if you will need the answers fast, schema on write is the only sensible way to go. • This strategy works best for old school BI types of scenarios on new school big data sets. Source: AdaptiveSystemsCopyright © William El Kaim 2016 67
  68. 68. Data Lake vs. Enterprise Data Warehouse Source: Martin FowlerCopyright © William El Kaim 2016 68
  69. 69. Data Lake vs. Enterprise Data Warehouse Source: ZaloniCopyright © William El Kaim 2016 69
  70. 70. Source: PWCCopyright © William El Kaim 2016 70
  71. 71. The Lakeshore Concept • The data lake shouldn't be accessed directly very much. • Because the data is raw, you need a lot of skill to make any sense of it. • Lakeshore • Create a number of data marts each of which has a specific model for a single bounded context. • A larger number of downstream users can then treat these lakeshore marts as an authoritative source for that context. Source: Martin FowlerCopyright © William El Kaim 2016 71
  72. 72. Data Lake vs. EDW Source: PlatforaCopyright © William El Kaim 2016 72
  73. 73. Data Lake vs. EDW Source: HortonWorksCopyright © William El Kaim 2016 73
  74. 74. Data Lake Tools: Bistep Source: BigstepCopyright © William El Kaim 2016 74
  75. 75. Data Lake Tools Informatica Data Lake Management • Informatica Data Lake Management now includes the following innovations brought together with a single metadata driven platform: • Informatica Enterprise Information Catalog enables business users to quickly discover and understand all enterprise data using intelligent self-service tools powered by machine learning and AI. • Informatica Intelligent Streaming helps organizations more efficiently capture and process big data (machine, social media feeds, website click stream, etc.) and real-time events to gain timely insight for business initiatives, such as IoT, marketing and fraud detection. • Informatica Blaze dramatically increases Hadoop processing performance with intelligent data pipelining, job partitioning, job recovery and scaling powered by a unique cluster aware data processing engine integrated with YARN. • Informatica Secure@Source® reduces the risk of proliferation of sensitive and private data with enhanced monitoring and alerting of sensitive data based on both risk conditions and user access/activity. • Informatica Big Data Management™ is easily and rapidly deployed in the cloud with the click of a button through the Microsoft Azure Marketplace to integrate, govern, and secure big data at scale in Hadoop. • Informatica Cloud Microsoft Azure Data Lake Store Connector helps customers achieve faster business insights by providing self-service connectivity to integrate and synchronize diverse data sets into Microsoft Azure Data Lake Store. Source: InformaticaCopyright © William El Kaim 2016 75
  76. 76. Data Lake Tools: Microsoft Azure Source: MicrosoftCopyright © William El Kaim 2016 76
  77. 77. Data Lake Tools: Koverse Source: KoverseCopyright © William El Kaim 2016 77
  78. 78. Data Lake Tools: Platfora Source: PlatforaCopyright © William El Kaim 2016 78
  79. 79. Big Data: Discovery • Waterline Data Inventory: Automated data discovery platform. Foundation of self service data preparation, analytics, and data governance in Hadoop. Copyright © William El Kaim 2016 79
  80. 80. Big Data: Discovery https://cloud.oracle.com/bigdatadiscoveryCopyright © William El Kaim 2016 80
  81. 81. Data Lake Tool: Zaloni Source: ZaloniCopyright © William El Kaim 2016 81
  82. 82. Data Lake Tool: Zdata Source: ZdataCopyright © William El Kaim 2016 82
  83. 83. Data Lake Resources • Introduction • PWC: Data lakes and the promise of unsiloed data • Zaloni: Resources • Tools • Bigstep • Informatica Data Lake Mgt & Intelligent Data Lake • Microsoft Azure Data Lake and Azure Data Catalog • Koverse • Oracle BigData Discovery • Platfora • Podium Data • Waterline Data Inventory • Zaloni Bedrock • Zdata Data Lake Copyright © William El Kaim 2016 83
  84. 84. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 84
  85. 85. What is BI on Hadoop? Three Options Source: DremioCopyright © William El Kaim 2016 85
  86. 86. BI on Hadoop Three Options: ETL to Data Warehouse • Pros • Relational databases and their BI integrations are very mature • Use your favorite tools • Tableau, Excel, R, … • Cons • Traditional ETL tools don’t work well with modern data • Changing schemas, complex or semi-structured data, … • Hand-coded scripts are a common substitute • Data freshness • How often do you replicate/synchronize? • Data resolution • Can’t store all the raw data in the RDBMS (due to scalability and/or cost) • Need to sample, aggregate or time-constrain the data Source: DremioCopyright © William El Kaim 2016 86
  87. 87. What is BI on Hadoop? Three Options: Monolithic Tools • Indexima • Jethro • Looker • Arcadia • Atscale • Datameer • Platfora • Tamr • ZoomData Source: Modified from Platfora • Single piece of software on top of Big Data • Performs both data visualization (BI) and execution • Utilize sampling or manual pre- aggregation to reduce the data volume that the user is interacting with Copyright © William El Kaim 2016 87
  88. 88. What is BI on Hadoop? Three Options: Monolithic Tools • Pros • Only one tool to learn and operate • Easier than building and maintain ETL-to-RDBMS pipeline • Integrated data preparation in some solutions • Cons • Can’t analyze the raw data • Rely on aggregation or sampling before primary analysis • Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …) • Can’t run arbitrary SQL queries Source: DremioCopyright © William El Kaim 2016 88
  89. 89. What is BI on Hadoop? Three Options: SQL-on-Hadoop • The combination of a familiar interface (SQL) along with a modern computing architecture (Hadoop) enables people to manipulate and query data in new and powerful ways. • There’s no shortage of SQL on Hadoop offerings, and each Hadoop distributor seems to have its preferred flavor. • Not all SQL-on-Hadoop tools are equal, so picking the right tool is a challenge. Source: Datanami & Cloudera & DremioCopyright © William El Kaim 2016 89
  90. 90. What is BI on Hadoop? Three Options: SQL-on-Hadoop • SQL on Hadoop tools could be categorized as • Interactive or Native SQL • Batch & Data-Science SQL • OLAP Cubes (In-memory) on Hadoop Copyright © William El Kaim 2016 90
  91. 91. What is BI on Hadoop? SQL-on-Hadoop: Native SQL • When to use it? • Excel at executing ad-hoc SQL queries and performing self-service data exploration often used directly by data analysts or at executing the machine-generated SQL code from BI tools like Qlik and Tableau. • Latency is usually measured in seconds to minutes. • One of the key differentiator among the interactive SQL-on-Hadoop tools is how they were built. • Some of the tools, such as Impala and Drill, were developed from the beginning to run on Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran on vendors’ massively parallel processing (MPP) databases Source: DatanamiCopyright © William El Kaim 2016 91
  92. 92. What is BI on Hadoop? SQL-on-Hadoop: Native SQL • Pros • Highest performance for Big Data workloads • Connect to Hadoop and also NoSQL systems • Make Hadoop “look like a database” • Cons • Queries may still be too slow for interactive analysis on many TB/PB • Can’t defeat physics Source: Datanami & Dremio • Interactive • In 2012, Cloudera rolled out the first release of Apache Impala • MapR has been pushing the schema- less bounds of SQL querying with Apache Drill, which is based on Google‘s Dremel. • Presto (created by Facebook, now backed by Teradata) • VectorH (backed by Actian) • Apache Hawq (backed by Pivotal) • Apache Phoenix. • BigSQL (backed by IBM) • Big Data SQL (backed by Oracle) • Vertica SQL on Hadoop (backed by Hewlett-Packard). Copyright © William El Kaim 2016 92
  93. 93. What is BI on Hadoop? SQL-on-Hadoop: Batch & Data Science SQL • When to use it? • Most often used for running big and complex jobs, including ETL and production data “pipelines,” against massive data sets. • Apache Hive is the best example of this tool category. The software essentially recreates a relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache Tez) as an intermediate processing layer. • Tools • Apache Hive, Apache Tez, Apache Spark SQL • Pros • Potentially simpler deployment (no daemons) • New YARN job (MapReduce/Spark) for each query • Check-pointing support enables very long-running queries • Days to weeks (ETL work) • Works well in tandem with machine learning (Spark) • Cons • Latency prohibitive for for interactive analytics • Tableau, Qlik Sense, … • Slower than native SQL engines Source: Datanami & DremioCopyright © William El Kaim 2016 93
  94. 94. What is BI on Hadoop? SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop • When to use it? • Data scientists doing self-service data exploration needing performance (in milliseconds to seconds). • Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with competition in this category. • Often require an In-memory computing architecture, • Tools • Apache Kylin, Apache Lens, AtScale, Druid, Kyvos Insights • In-memory: Spark SQL, Apache Flink, Kognitio On Hadoop • Other Options To Investigate: • SnappyData (Strong SQL, In-Memory Speed, and GemfireXD history) • Apache HAWQ (Strong SQL support and Greenplum history) • Splice Machine (Now Open Source) • Hive LLAP is moving into OLAP, SQL 2011 support is growing and so is performance. • Apache Phoenix may be able to do basic OLAP with some help from Saiku OLAP BI Tool. • Most tools use Apache Calcite Copyright © William El Kaim 2016 94Source: Dremio
  95. 95. What is BI on Hadoop? SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop • Pros • Fast queries on pre-aggregated data • Can use SQL and MDX tools • Cons • Explicit cube definition/modeling phase • Not “self-service” • Frequent updates required due to dependency on business logic • Aggregation create and maintenance can be long (and large) • User connects to and interacts with the cube • Can’t interact with the raw data Copyright © William El Kaim 2016 95
  96. 96. What is BI on Hadoop? SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop • Apache Kylin lets you query massive data set at sub-second latency in 3 steps. 1. Identify a Star Schema data on Hadoop. 2. Build Cube on Hadoop. 3. Query data with ANSI-SQL and get results via ODBC, JDBC or RESTful API. Copyright © William El Kaim 2016 96Source: Apache Kylin
  97. 97. What is BI on Hadoop? SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop Copyright © William El Kaim 2016 97http://kognitio.com/on-hadoop/
  98. 98. What is BI on Hadoop? SQL-on-Hadoop: Synthesis • Pros • Continue using your favorite BI tools and SQL-based clients • Tableau, Qlik, Power BI, Excel, R, SAS, … • Technical analysts can write custom SQL queries • Cons • Another layer in your data stack • May need to pre-aggregate the data depending on your scale • Need a separate data preparation tool (or custom scripts) Source: DremioCopyright © William El Kaim 2016 98
  99. 99. What is BI on Hadoop? SQL-on-Hadoop: Synthesis Source: DremioCopyright © William El Kaim 2016 99
  100. 100. What is BI on Hadoop? SQL-on-Hadoop: Encoding Formats • The different encoding standards result in different block sizes, and that can impact performance. • ORC files compress smaller than Parquet files which can be a decisive choice factor. • Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while Hive and others support optimized row column (ORC) files, sequence files, or plain text. • Semi-structured data format like JSON is gaining traction • Previously Hadoop users were using MapReduce to pound unstructured data into a more structured or relational format. • Drill opened up SQL-based access directly to semi-structured data, such as JSON, which is a common format found on NoSQL and SQL databases. Cloudera also recently added support for JSON in Impala. Source: DatanamiCopyright © William El Kaim 2016 100
  101. 101. What is BI on Hadoop? SQL-on-Hadoop: Decision Tree Source: DremioCopyright © William El Kaim 2016 101
  102. 102. What is BI on Hadoop? Example: AtScale Source: AtScaleCopyright © William El Kaim 2016 102
  103. 103. Plan • Taming The Data Deluge • What is Big Data? • Why Now? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Big Data Analytics? • Big Data Technologies • Hadoop Distributions & Tools • Hadoop Architecture Examples Copyright © William El Kaim 2016 103
  104. 104. What is Big Data Analytics? From Hindsight to Insight to Foresight Copyright © William El Kaim 2016 104
  105. 105. What is Big Data Analytics? Data Science • Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms Copyright © William El Kaim 2016 105
  106. 106. What is Big Data Analytics? Data Science Maturity Model Copyright © William El Kaim 2016 106Source: Domino
  107. 107. What is Big Data Analytics? Data Science Platform: Domino Copyright © William El Kaim 2016 107https://www.dominodatalab.com/
  108. 108. What is Big Data Analytics? Predictive Analytics Source: wikipedia Source: Forrester Copyright © William El Kaim 2016 108
  109. 109. What is Big Data Analytics? Finding the Best Approach Source: kdnuggetsCopyright © William El Kaim 2016 109
  110. 110. What is Big Data Analytics? What is Machine Learning? • Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. • Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. • Machine learning is ideal for exploiting the opportunities hidden in big data. Source: Rubén Casado Tejedor Copyright © William El Kaim 2016 110
  111. 111. What is Big Data Analytics? Machine Learning Use Cases • Programming computers to perform an action using example data or past experience • learn from and make predictions on data • It is used when: • Human expertise does not exist (e.g. Navigating on Mars) • Humans are unable to explain their expertise (e.g. Speech recognition) • Solution changes in time (e.g. Routing on a computer network) • Solution needs to be adapted to particular cases (e.g. User biometrics) Source: Rubén Casado Tejedor Copyright © William El Kaim 2016 111
  112. 112. What is Big Data Analytics? Machine Learning Terminology • Observations • Items or entities used for learning or evaluation (e.g., emails) • Features • Attributes (typically numeric) used to represent an observation (e.g. length, date, presence of keywords) • Labels – Values / categories assigned to observations (e.g., spam, not-spam) • Training and Test Data – Observations used to train and evaluate a learning algorithm (e.g., a set of emails along with their labels) – Training data is given to the algorithm for training while Test data is withheld at train time Source: Rubén Casado Tejedor Copyright © William El Kaim 2016 112
  113. 113. What is Big Data Analytics? Machine Learning Types • Supervised Learning: Learning from labelled observations • Data is tagged • Tagging may require the help of an expert to prepare the training set. • Expertise is needed before machine learning. • The challenge is about the generalization of the model • Algorithms: Classification - Regression / Prediction - Recommendation • Unsupervised Learning: Learning algorithm must find latent structure from features alone. • Output values are not known (aka. the tags, and their nature) • Some of the attributes might be not homogeneous amongst all the samples • The expertise is needed after machine learning, to interpret the results, and name the discovered categories • The challenge is about understanding the output classification • Algorithms: generally group inputs by similarities (creating clusters) • Clustering - Dimensionality Reduction - Anomaly detection Copyright © William El Kaim 2016 113
  114. 114. What is Big Data Analytics? Machine Learning Types Source: Rubén Casado Tejedor Source: Louis Dorard The two phases of machine learning: • TRAIN a model • PREDICT with a model Copyright © William El Kaim 2016 114
  115. 115. What is Big Data Analytics? Machine Learning Example: Scikit-learn Algorithms Cheat-seet Source: scikit-learnCopyright © William El Kaim 2016 115
  116. 116. What is Big Data Analytics? Machine Learning Example: Wendelin http://www.wendelin.io/Copyright © William El Kaim 2016 116
  117. 117. What is Big Data Analytics? Machine Learning Example: Microsoft Azure ML Cloud Source: Microsoft Copyright © William El Kaim 2016 117
  118. 118. Big Data: Azure Machine Learning Source: MicrosoftCopyright © William El Kaim 2016 118
  119. 119. What is Big Data Analytics? Approaches to Big Data Analytics: Sampling • When to Use it? • Only data exploration / data understanding • Early prototyping on prepared and clean data • Machine Learning modeling with very few and basic patterns (e.g. only a handful of columns and binary prediction target) • When NOT to use it? • Large number of columns in the data • Need to blend large data sets (e.g. large-scale joins) • Complex Machine Learning models • Looking for rare events • Pros • Simple and easy to start with • Usually works well for data exploration and early prototyping • Some ML models would not benefit from more data anyway • Cons • Many ML models would benefit from more data • Cannot be used when large scale data preparation is needed • Hadoop is used as a data repository only • Key Process • Data Movement: Pulls sample data from HDFS/Hive/Impala • Data Processing: In the analytics tool Source: RapidMinerCopyright © William El Kaim 2016 119
  120. 120. What is Big Data Analytics? Approaches to Big Data Analytics: Grid Computing • When to Use it? • Task can be performed on smaller, independent data subsets • Compute-intensive data processing • When NOT to use it? • Data-intensive data processing • Complex Machine Learning models • Lots of interdependencies between data subsets • Pros • Hadoop is used for parallel processing in addition to using as a data source • Cons • Only works if data subsets can be processed independently • Only as good as the single-node engine, no benefit from fast-evolving Hadoop innovations • Key Process • Data Movement: Only results are moved, data remains in Hadoop • Data Processing: Custom single-node application running on multiple Hadoop nodes Source: RapidMinerCopyright © William El Kaim 2016 120
  121. 121. What is Big Data Analytics? Approaches to Big Data Analytics: Native Distributed Algorithms • When to Use it? • Complex Machine Learning models needed • Lots of interdependencies inside the data (e.g. graph analytics) • Need to blend and cleanse large data sets (e.g. large-scale joins) • When NOT to use it? • Data is not that large • Sample would reveal all interesting patterns • Pros • Holistic view of all data and patterns • Highly scalable distributed processing optimized • for Hadoop • Cons • Limited set of algorithms available, very hard to develop new algorithms • Key Process • Data Movement: Only results are moved, data remains in Hadoop • Data Processing: Executed by native Hadoop tools: Hive, Spark, H2O, Pig, MapReduce, etc. Source: RapidMinerCopyright © William El Kaim 2016 121
  122. 122. What is Big Data Analytics? Approaches to Big Data Analytics: RapidMiner Example Source: RapidMinerCopyright © William El Kaim 2016 122
  123. 123. Twitter http://www.twitter.com/welkaim SlideShare http://www.slideshare.net/welkaim EA Digital Codex http://www.eacodex.com/ Linkedin http://fr.linkedin.com/in/williamelkaim Claudine O'Sullivan Copyright © William El Kaim 2016 123

×