Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to Big Data
William El Kaim
Oct. 2016 – V3.0
This Presentation is part of the
Enterprise Architecture Digital Codex
http://www.eacodex.com/Copyright © William El Kaim ...
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
Taming the Data Deluge
Copyright © William El Kaim 2016 4Source: Domo
Copyright © William El Kaim 2016 5
Taming the Data Deluge
Copyright © William El Kaim 2016 6
Taming the Data Deluge
Copyright © William El Kaim 2016 7
At The Same Time …
Copyright © William El Kaim 2016 8
Data Science (R)evolution
Copyright © William El Kaim 2016 9
Source: Capgemini
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
What is Big Data?
• A collection of data sets so large and complex that it becomes difficult to
process using on-hand data...
12Copyright © William El Kaim 2015
What is Big Data: the “Vs” to Nirvana
Copyright © William El Kaim 2016 13
Visualization
Source: James Higginbotham
Big Dat...
Six V to Nirvana
Copyright © William El Kaim 2016 14Source: Bernard Marr
Six V to Nirvana
Copyright © William El Kaim 2016 15Source: IBM
Six V to Nirvana
Copyright © William El Kaim 2016 16Source: IBM
Six V to Nirvana
Copyright © William El Kaim 2016 17Source: IBM
Six V to Nirvana
Copyright © William El Kaim 2016 18Source: IBM
Six V to Nirvana
Copyright © William El Kaim 2016 19Source: Bernard Marr
Visualization
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
Hadoop Genesis
• Scalability issue when running jobs processing terabytes of data
• Could span dozen days just to read tha...
Hadoop Genesis
Copyright © William El Kaim 2016 22
Apache Nutch
Doug Cutting
“Map-reduce”
2004
“It is an important techniq...
Hadoop Genesis
Copyright © William El Kaim 2016 23
“Hadoop came from the name my kid gave a stuffed yellow elephant. Short...
Enters Hadoop the Big Data Refinery
• Hadoop is not replacing anything.
• Hadoop has become another
component in an organi...
Hadoop Platform
Copyright © William El Kaim 2016 25Source: Octo Technology
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
When to Use Hadoop?
Two Main Use Cases
2
1
Data Mgt.
And Storage
Big Data
Analytics
and Use
Copyright © William El Kaim 20...
When to Use Hadoop?
Copyright © William El Kaim 2016 28Source: Octo Technology
1
2
Data Mgt and Storage
Data Analytics and...
Hadoop for Data Mgt and Storage
ETL Pre-processor
Copyright © William El Kaim 2016 29
ETL Pre-processor
• Shift the pre-pr...
Hadoop for Data Mgt and Storage
Massive Storage
Copyright © William El Kaim 2016 30
Massive Storage
• Offloading large vol...
Hadoop for Data Analytics and Use
Data Discovery
Copyright © William El Kaim 2016 31
Data Discovery
• Keep data warehouse ...
Hadoop for Data Analytics and Use
From Hindsight to Insight to Foresight
• Traditional analytics tools are not well
suited...
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
How to Implement Hadoop?
The Rise of Big Data Fabric
• Definition:
• Bringing together disparate big data sources automati...
How to Implement Hadoop?
Make Big Data Fabric Part Of Your Big Data Strategy!
• Enterprise architects whose companies are ...
How to Implement Hadoop?
Typical Project Development Steps
Copyright © William El Kaim 2016 36
Dashboards,
Reports,
Visual...
How to Implement Hadoop?
Typical Three Steps Process
Copyright © William El Kaim 2016 37Source: HortonWorks
Collect Explor...
How to Implement Hadoop?
Typical Project Development Steps
Copyright © William El Kaim 2016 38
1. Data Source
Layer
3. Dat...
How to Implement Hadoop?
CRISP Methodology
Copyright © William El Kaim 2016 39Source: The Modeling Agency and sv-europe
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
What is a Data Lake?
• A data lake is:
• Typically built using Hadoop.
• Supports structured or
unstructured data.
• Benef...
Who is Using Data Lakes?
Copyright © William El Kaim 2016 42Source: PWC
Data Flow In The Data Lake
Copyright © William El Kaim 2016 43Source: PWC
Schema On Read or Schema On Write?
• The Hadoop data lake concept can be summed up as, “Store it all in one
place, figure ...
Schema On Write …
• Before any data is written in the database, the structure of that data is strictly
defined, and that m...
Schema On Read …
• Revolutionary concept: “You don’t have to know what you’re going to do with
your data before you store ...
Schema On Read or Schema On Write?
• Schema on read options tend to be a better choice:
• for exploration, for “unknown un...
Data Lake Resources
• Introduction
• PWC: Data lakes and the promise of unsiloed data
• Zaloni: Resources
• Tools
• Bigste...
Data Lake Tools: Platfora
Copyright © William El Kaim 2016 49Source: Platfora
Data Lake Tool: Waterline
Copyright © William El Kaim 2016 50Source: Waterline
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
What is BI on Hadoop?
Three Options
Source: DremioCopyright © William El Kaim 2016 52
What is BI on Hadoop?
Three Options: ETL to Data Warehouse
• Pros
• Relational databases and their BI integrations are ver...
What is BI on Hadoop?
Three Options: Monolithic Tools
• Indexima
• Jethro
• Looker
• Arcadia
• Atscale
• Datameer
• Platfo...
What is BI on Hadoop?
Three Options: Monolithic Tools
• Pros
• Only one tool to learn and operate
• Easier than building a...
What is BI on Hadoop?
Three Options: SQL-on-Hadoop
• The combination of a familiar interface (SQL) along with a modern com...
What is BI on Hadoop?
Three Options: SQL-on-Hadoop
• SQL on Hadoop tools could be categorized as
• Interactive or Native S...
What is BI on Hadoop?
SQL-on-Hadoop: Native SQL
• When to use it?
• Excel at executing ad-hoc SQL queries and performing s...
What is BI on Hadoop?
SQL-on-Hadoop: Native SQL
• Pros
• Highest performance for Big Data
workloads
• Connect to Hadoop an...
What is BI on Hadoop?
SQL-on-Hadoop: Batch & Data Science SQL
• When to use it?
• Most often used for running big and comp...
What is BI on Hadoop?
SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop
• When to use it?
• Data scientists doing self-servi...
What is BI on Hadoop?
SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop
• Apache Kylin lets you query massive data set at su...
What is BI on Hadoop?
SQL-on-Hadoop: Synthesis
• Pros
• Continue using your favorite BI tools and SQL-based clients
• Tabl...
What is BI on Hadoop?
SQL-on-Hadoop: Synthesis
Source: DremioCopyright © William El Kaim 2016 64
What is BI on Hadoop?
SQL-on-Hadoop: Encoding Formats
• The different encoding standards result in different block sizes, ...
What is BI on Hadoop?
SQL-on-Hadoop: Decision Tree
Source: DremioCopyright © William El Kaim 2016 66
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
What is Data Science?
• Data Science is an interdisciplinary field about processes and systems to
extract knowledge or ins...
From Hindsight to Insight to Foresight
Copyright © William El Kaim 2016 69
What is Data Science?
Copyright © William El Kaim 2016 70Source: wikipedia
Source: Forrester
What is Data Science?
Algorithms In Decision Making
Copyright © William El Kaim 2016 71
What is Data Science?
Tools: Anaconda
Copyright © William El Kaim 2016 72https://www.continuum.io/
What is Data Science?
Tools: Dataiku DSS
http://www.dataiku.com/dss/
Create Machine Learning ModelsCombine and Join Datase...
What is Data Science?
Tools: IBM Data Science Experience
• Cloud-based development
environment for near real-time,
high pe...
What is Data Science?
Tools: Tamr
http://www.tamr.com/Copyright © William El Kaim 2016 75
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
Hadoop Processing Paradigms
Batch processing
• Large amount of statics data
• Generally incurs a high-latency / Volume
Rea...
Storage Technologies: Cost & Speed
Copyright © William El Kaim 2016 78
Storage Technology: Encoding Format
• Apache Parquet vs. Apache Orc
Copyright © William El Kaim 2016 79
Hadoop Batch processing
Copyright © William El Kaim 2016 80
• Scalable
• Large amount of static data
• Distributed
• Paral...
• MapReduce was designed by
Google as a programming model for
processing large data sets with a
parallel, distributed algo...
• Processing can occur on data stored either in a filesystem (unstructured) or
in a database (structured).
• MapReduce can...
Hadoop – Batch Processing - Map Reduce
Copyright © William El Kaim 2016 83
Real-time Processing
Copyright © William El Kaim 2016 84
• Low latency
• Continuous
unbounded
streams of data
• Distribute...
• Computational model and Infrastructure for continuous data processing, with
the ability to produce low-latency results
•...
Real-time (Stream) Processing
• (Event-) Stream Processing
• A one-at-a-time processing model
• A datum is processed as it...
Real-time (Stream) Processing
Copyright © William El Kaim 2016 87Source: Trivadis
Real-time (Stream) Processing Arch. Pattern
Copyright © William El Kaim 2016 88Source: Cloudera
Hybrid Computation Model
Copyright © William El Kaim 2016 89
• Low latency
• Massive data + Streaming data
• Scalable
• Co...
Hybrid Computation: Lambda Architecture
• Data-processing architecture designed to handle massive quantities of data
by ta...
• Batch layer
• Receives arriving data, combines it with historical data and recomputes results by
iterating over the enti...
Hybrid computation: Lambda Architecture
Copyright © William El Kaim 2016 92Source: Mapr
Hybrid computation: Kappa Architecture
• Proposal from Jay Kreps (LinkedIn) in this article.
• Then talk “Turning the data...
Hybrid computation: Kappa Architecture
• Architecture is composed of only two layers:
• The stream processing layer runs t...
Hybrid computation: Lambda vs. Kappa
Copyright © William El Kaim 2016 95
Lambda
Kappa
Source: Kreps
Used to value all data...
Hadoop Processing Paradigms Evolutions
Copyright © William El Kaim 2016 96
Source: Rubén Casado Tejedor
Hadoop Architecture
Copyright © William El Kaim 2016 97
Metadata
Source Data
Computed
Data
Data Lake
Data Preparation
Data...
Hadoop Technologies
Copyright © William El Kaim 2016
98
NoSQL Databases:
Cassandra, Ceph,
DynamoDB, Hbase,
Hive, Impala, R...
Plan
• Taming The Data Deluge
• What is Big Data?
• What is Hadoop?
• When to use Hadoop?
• How to Implement Hadoop?
• Wha...
Big Data Landscape
Hadoop Distributors
• Three Main pure-play Hadoop distributors
• Cloudera, Hortonworks, and MapR Techno...
Big Data Landscape
Data Preparation
Copyright © William El Kaim 2016 101
Source: Bloor Research
Big Data Landscape
Business Intelligence and Analytics Platforms
Copyright © William El Kaim 2016 102
Big Data Landscape
Hadoop Distributions To Start With
• Apache Hadoop
• Cloudera Live
• Dataiku
• Hortonworks Sandbox
• IB...
104Copyright © William El Kaim 2016
Twitter
http://www.twitter.com/welkaim
SlideShare
http://www.slideshare.net/welkaim
EA...
Upcoming SlideShare
Loading in …5
×

Introduction to big data

7,925 views

Published on

Introduction to Big Data and Hadoop.

Published in: Technology

Introduction to big data

  1. 1. Introduction to Big Data William El Kaim Oct. 2016 – V3.0
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex http://www.eacodex.com/Copyright © William El Kaim 2016 2
  3. 3. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 3
  4. 4. Taming the Data Deluge Copyright © William El Kaim 2016 4Source: Domo
  5. 5. Copyright © William El Kaim 2016 5
  6. 6. Taming the Data Deluge Copyright © William El Kaim 2016 6
  7. 7. Taming the Data Deluge Copyright © William El Kaim 2016 7
  8. 8. At The Same Time … Copyright © William El Kaim 2016 8
  9. 9. Data Science (R)evolution Copyright © William El Kaim 2016 9 Source: Capgemini
  10. 10. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 10
  11. 11. What is Big Data? • A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications • Due to its technical nature, the same challenges arise in Analytics at much lower volumes than what is traditionally considered Big Data. • Big Data Analytics is: • The same as ‘Small Data’ Analytics, only with the added challenges (and potential) of large datasets (~50M records or 50GB size, or more) • Challenges : • Data storage and management • De-centralized/multi-server architectures • Performance bottlenecks, poor responsiveness • Increasing hardware requirements Copyright © William El Kaim 2015 11Source: SiSense
  12. 12. 12Copyright © William El Kaim 2015
  13. 13. What is Big Data: the “Vs” to Nirvana Copyright © William El Kaim 2016 13 Visualization Source: James Higginbotham Big Data: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications Big Data: When the data could not fit in Excel. Used to be 65,536 lines, Now 1,048,577. Big Data: When it's cheaper to keep everything than spend the effort to decide what to throw away (David Brower @dbrower)
  14. 14. Six V to Nirvana Copyright © William El Kaim 2016 14Source: Bernard Marr
  15. 15. Six V to Nirvana Copyright © William El Kaim 2016 15Source: IBM
  16. 16. Six V to Nirvana Copyright © William El Kaim 2016 16Source: IBM
  17. 17. Six V to Nirvana Copyright © William El Kaim 2016 17Source: IBM
  18. 18. Six V to Nirvana Copyright © William El Kaim 2016 18Source: IBM
  19. 19. Six V to Nirvana Copyright © William El Kaim 2016 19Source: Bernard Marr Visualization
  20. 20. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 20
  21. 21. Hadoop Genesis • Scalability issue when running jobs processing terabytes of data • Could span dozen days just to read that amount of data on 1 computer • Need lots of cheap computers • To Fix speed problem • But lead to reliability problems • In large clusters, computers fail every day • Cluster size is not fixed • Need common infrastructure • Must be efficient and reliable Copyright © William El Kaim 2016 21
  22. 22. Hadoop Genesis Copyright © William El Kaim 2016 22 Apache Nutch Doug Cutting “Map-reduce” 2004 “It is an important technique!” Extended Source: Xiaoxiao Shi
  23. 23. Hadoop Genesis Copyright © William El Kaim 2016 23 “Hadoop came from the name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.” Doug Cutting, Hadoop project creator • Open Source Apache Project • Written in Java • Running on Commodity hardware and all major OS • Linux, Mac OS/X, Windows, and Solaris
  24. 24. Enters Hadoop the Big Data Refinery • Hadoop is not replacing anything. • Hadoop has become another component in an organizations enterprise data platform. • Hadoop (Big Data Refinery) can ingest data from all types of different sources. • Hadoop then interacts and has data flows with traditional systems that provide transactions and interactions (relational databases) and business intelligence and analytic systems (data warehouses). Source: DBA Journey Blog 24Copyright © William El Kaim 2015
  25. 25. Hadoop Platform Copyright © William El Kaim 2016 25Source: Octo Technology
  26. 26. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 26
  27. 27. When to Use Hadoop? Two Main Use Cases 2 1 Data Mgt. And Storage Big Data Analytics and Use Copyright © William El Kaim 2016 27
  28. 28. When to Use Hadoop? Copyright © William El Kaim 2016 28Source: Octo Technology 1 2 Data Mgt and Storage Data Analytics and Use 1 2
  29. 29. Hadoop for Data Mgt and Storage ETL Pre-processor Copyright © William El Kaim 2016 29 ETL Pre-processor • Shift the pre-processing of ETL in staging data warehouse to Hadoop • Shifts high cost data warehousing to lower cost Hadoop clusters Source: Microsoft 1
  30. 30. Hadoop for Data Mgt and Storage Massive Storage Copyright © William El Kaim 2016 30 Massive Storage • Offloading large volume of historical data into cold storage with Hadoop • Keep data warehouse for hot data to allow BI and analytics • When data from cold storage is needed, it can be moved back into the warehouse Source: Microsoft 1
  31. 31. Hadoop for Data Analytics and Use Data Discovery Copyright © William El Kaim 2016 31 Data Discovery • Keep data warehouse for operational BI and analytics • Allow data scientists to gain new discoveries on raw data (no format or structure) • Operationalize discoveries back into the warehouse Source: Microsoft 2
  32. 32. Hadoop for Data Analytics and Use From Hindsight to Insight to Foresight • Traditional analytics tools are not well suited to capturing the full value of big data. • The volume of data is too large for comprehensive analysis. • The range of potential correlations and relationships between disparate data sources are too great for any analyst to test all hypotheses and derive all the value buried in the data. • Basic analytical methods used in business intelligence and enterprise reporting tools reduce to reporting sums, counts, simple averages and running SQL queries. • Online analytical processing is merely a systematized extension of these basic analytics that still rely on a human to direct activities specify what should be calculated. Copyright © William El Kaim 2016 32 2
  33. 33. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 33
  34. 34. How to Implement Hadoop? The Rise of Big Data Fabric • Definition: • Bringing together disparate big data sources automatically, intelligently, and securely, and processing them in a big data platform technology, such as Hadoop and Apache Spark, to deliver a unified, trusted, and comprehensive view of customer and business data. • Big data fabric focuses on automating the process of ingestion, curation, and integrating big data sources to deliver intelligent insights that are critical for businesses to succeed. • The platform minimizes complexity by automating processes, generating big data technology and platform code automatically, and integrating workflows to simplify the deployment. • Big data fabric is not just about Hadoop or Spark • It comprises several components, all of which must work in tandem to deliver a flexible, integrated, secure, and scalable platform. Copyright © William El Kaim 2016 34Source: Forrester
  35. 35. How to Implement Hadoop? Make Big Data Fabric Part Of Your Big Data Strategy! • Enterprise architects whose companies are pursuing a big data strategy can benefit from a big data fabric implementation that automates, secures, integrates, and curates big data sources intelligently. • Most enterprises that have a big data fabric platform are building it themselves by integrating various core open source technologies, however it requires significant time and effort. • Your big data fabric strategy should: • Integrate only a few big data sources at first. • Start top-down rather than bottom-up, keeping the end in mind. • Separate analytics from data management. Analytics tools should focus primarily on data visualization and advanced statistical/data mining algorithms with limited dependence on data management functions. Decoupling data management from data analytics reduces the time and effort needed to deliver trusted analytics. • Create a team of experts to ensure success. • Use automation and machine learning to accelerate deployment. Copyright © William El Kaim 2016 35Source: Forrester
  36. 36. How to Implement Hadoop? Typical Project Development Steps Copyright © William El Kaim 2016 36 Dashboards, Reports, Visualization, … CRM, ERP Web, Mobile Point of sale Big Data Platform Business Transactions & Interactions Business Intelligence & Analytics Unstructured Data Log files DB data Exhaust Data Social Media Sensors, devices Classic Data Integration & ETL Capture Big Data Collect data from all sources structured &unstructured Process Transform, refine, aggregate, analyze, report Distribute Results Interoperate and share data with applications/analytics Feedback Use operational data w/in the big data platform 1 2 3 4 Source: HortonWorks
  37. 37. How to Implement Hadoop? Typical Three Steps Process Copyright © William El Kaim 2016 37Source: HortonWorks Collect Explore Enrich
  38. 38. How to Implement Hadoop? Typical Project Development Steps Copyright © William El Kaim 2016 38 1. Data Source Layer 3. Data Processing / Analysis Layer 2. Data Storage Layer 4. Data Output Layer LakeshoreRaw Data Data Lake Bus. or Usage oriented extractions Data Science Correlations, Analytics, Machine Learning Data Visualization & Bus. intelligence Internal Data External Data 1 2 3 3 4
  39. 39. How to Implement Hadoop? CRISP Methodology Copyright © William El Kaim 2016 39Source: The Modeling Agency and sv-europe
  40. 40. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 40
  41. 41. What is a Data Lake? • A data lake is: • Typically built using Hadoop. • Supports structured or unstructured data. • Benefiting from a variety of storage and processing tools to extract value quickly. • Requiring little or no processing for adapting the structure to an enterprise schema • A central location in which to store all your data in its native form, regardless of its source or format. Copyright © William El Kaim 2016 41Source: Zaloni
  42. 42. Who is Using Data Lakes? Copyright © William El Kaim 2016 42Source: PWC
  43. 43. Data Flow In The Data Lake Copyright © William El Kaim 2016 43Source: PWC
  44. 44. Schema On Read or Schema On Write? • The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” • But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. • And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. • At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. Copyright © William El Kaim 2016 44Source: AdaptiveSystems
  45. 45. Schema On Write … • Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. • Irrelevant data is discarded, data types, lengths and positions are all delineated. • The schema; the columns, rows, tables and relationships are all defined first for the specific purpose that database will serve. • Then the data is filled into its pre-defined positions. The data must all be cleansed, transformed and made to fit in that structure before it can be stored in a process generally referred to as ETL (Extract Transform Load). • That is why it is called “schema on write” because the data structure is already defined when the data is written and stored. • For a very long time, it was believed that this was the only right way to manage data. Copyright © William El Kaim 2016 45Source: AdaptiveSystems
  46. 46. Schema On Read … • Revolutionary concept: “You don’t have to know what you’re going to do with your data before you store it.” • Data of many types, sizes, shapes and structures can all be thrown into the Hadoop Distributed File System, and other Hadoop data storage systems. • While some metadata needs to be stored, to know what’s in there, no need yet to know how it will be structured! • Therefore, the data is stored in its original granular form, with nothing thrown away • In fact, no structural information is defined at all when the data is stored. • So “schema on read” implies that the schema is defined at the time the data is read and used, not at the time that it is written and stored. • When someone is ready to use that data, then, at that time, they define what pieces are essential to their purpose, where to find those pieces of information that matter for that purpose, and which pieces of the data set to ignore. Copyright © William El Kaim 2016 46Source: AdaptiveSystems
  47. 47. Schema On Read or Schema On Write? • Schema on read options tend to be a better choice: • for exploration, for “unknown unknowns,” when you don’t know what kind of questions you might want to ask, or the kinds of questions might change over time. • when you don’t have a strong need for immediate responses. They’re ideal for data exploration projects, and looking for new insights with no specific goal in mind. • Schema on write options tend to be very efficient for “known unknowns.” • When you know what questions you’re going to need to ask, especially if you will need the answers fast, schema on write is the only sensible way to go. • This strategy works best for old school BI types of scenarios on new school big data sets. Copyright © William El Kaim 2016 47Source: AdaptiveSystems
  48. 48. Data Lake Resources • Introduction • PWC: Data lakes and the promise of unsiloed data • Zaloni: Resources • Tools • Bigstep • Informatica Data Lake Mgt & Intelligent Data Lake • Microsoft Azure Data Lake and Azure Data Catalog • Koverse • Oracle BigData Discovery • Platfora • Podium Data • Waterline Data Inventory • Zaloni Bedrock • Zdata Data Lake Copyright © William El Kaim 2016 48
  49. 49. Data Lake Tools: Platfora Copyright © William El Kaim 2016 49Source: Platfora
  50. 50. Data Lake Tool: Waterline Copyright © William El Kaim 2016 50Source: Waterline
  51. 51. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 51
  52. 52. What is BI on Hadoop? Three Options Source: DremioCopyright © William El Kaim 2016 52
  53. 53. What is BI on Hadoop? Three Options: ETL to Data Warehouse • Pros • Relational databases and their BI integrations are very mature • Use your favorite tools • Tableau, Excel, R, … • Cons • Traditional ETL tools don’t work well with modern data • Changing schemas, complex or semi-structured data, … • Hand-coded scripts are a common substitute • Data freshness • How often do you replicate/synchronize? • Data resolution • Can’t store all the raw data in the RDBMS (due to scalability and/or cost) • Need to sample, aggregate or time-constrain the data Source: DremioCopyright © William El Kaim 2016 53
  54. 54. What is BI on Hadoop? Three Options: Monolithic Tools • Indexima • Jethro • Looker • Arcadia • Atscale • Datameer • Platfora • Tamr • ZoomData Source: Modified from Platfora • Single piece of software on top of Big Data • Performs both data visualization (BI) and execution • Utilize sampling or manual pre- aggregation to reduce the data volume that the user is interacting with Copyright © William El Kaim 2016 54
  55. 55. What is BI on Hadoop? Three Options: Monolithic Tools • Pros • Only one tool to learn and operate • Easier than building and maintain ETL-to-RDBMS pipeline • Integrated data preparation in some solutions • Cons • Can’t analyze the raw data • Rely on aggregation or sampling before primary analysis • Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …) • Can’t run arbitrary SQL queries Source: DremioCopyright © William El Kaim 2016 55
  56. 56. What is BI on Hadoop? Three Options: SQL-on-Hadoop • The combination of a familiar interface (SQL) along with a modern computing architecture (Hadoop) enables people to manipulate and query data in new and powerful ways. • There’s no shortage of SQL on Hadoop offerings, and each Hadoop distributor seems to have its preferred flavor. • Not all SQL-on-Hadoop tools are equal, so picking the right tool is a challenge. Source: Datanami & Cloudera & DremioCopyright © William El Kaim 2016 56
  57. 57. What is BI on Hadoop? Three Options: SQL-on-Hadoop • SQL on Hadoop tools could be categorized as • Interactive or Native SQL • Batch & Data-Science SQL • OLAP Cubes (In-memory) on Hadoop Copyright © William El Kaim 2016 57
  58. 58. What is BI on Hadoop? SQL-on-Hadoop: Native SQL • When to use it? • Excel at executing ad-hoc SQL queries and performing self-service data exploration often used directly by data analysts or at executing the machine-generated SQL code from BI tools like Qlik and Tableau. • Latency is usually measured in seconds to minutes. • One of the key differentiator among the interactive SQL-on-Hadoop tools is how they were built. • Some of the tools, such as Impala and Drill, were developed from the beginning to run on Hadoop clusters, while others are essentially ports of existing SQL engines that previously ran on vendors’ massively parallel processing (MPP) databases Source: DatanamiCopyright © William El Kaim 2016 58
  59. 59. What is BI on Hadoop? SQL-on-Hadoop: Native SQL • Pros • Highest performance for Big Data workloads • Connect to Hadoop and also NoSQL systems • Make Hadoop “look like a database” • Cons • Queries may still be too slow for interactive analysis on many TB/PB • Can’t defeat physics Source: Datanami & Dremio • Interactive • In 2012, Cloudera rolled out the first release of Apache Impala • MapR has been pushing the schema- less bounds of SQL querying with Apache Drill, which is based on Google‘s Dremel. • Presto (created by Facebook, now backed by Teradata) • VectorH (backed by Actian) • Apache Hawq (backed by Pivotal) • Apache Phoenix. • BigSQL (backed by IBM) • Big Data SQL (backed by Oracle) • Vertica SQL on Hadoop (backed by Hewlett-Packard). Copyright © William El Kaim 2016 59
  60. 60. What is BI on Hadoop? SQL-on-Hadoop: Batch & Data Science SQL • When to use it? • Most often used for running big and complex jobs, including ETL and production data “pipelines,” against massive data sets. • Apache Hive is the best example of this tool category. The software essentially recreates a relational-style database atop HDFS, and then uses MapReduce (or more recently, Apache Tez) as an intermediate processing layer. • Tools • Apache Hive, Apache Tez, Apache Spark SQL • Pros • Potentially simpler deployment (no daemons) • New YARN job (MapReduce/Spark) for each query • Check-pointing support enables very long-running queries • Days to weeks (ETL work) • Works well in tandem with machine learning (Spark) • Cons • Latency prohibitive for for interactive analytics • Tableau, Qlik Sense, … • Slower than native SQL engines Source: Datanami & DremioCopyright © William El Kaim 2016 60
  61. 61. What is BI on Hadoop? SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop • When to use it? • Data scientists doing self-service data exploration needing performance (in milliseconds to seconds). • Apache Spark SQL pretty much owns this category, although Apache Flink could provide Spark SQL with competition in this category. • Often require an In-memory computing architecture, • Tools • Apache Kylin, AtScale, Kyvos Insights, • In-memory: Spark SQL, Apache Flink • Pros • Fast queries on pre-aggregated data • Can use SQL and MDX tools • Cons • Explicit cube definition/modeling phase • Not “self-service” • Frequent updates required due to dependency on business logic • Aggregation create and maintenance can be long (and large) • User connects to and interacts with the cube • Can’t interact with the raw data Source: DremioCopyright © William El Kaim 2016 61
  62. 62. What is BI on Hadoop? SQL-on-Hadoop: OLAP Cubes (In-memory) on Hadoop • Apache Kylin lets you query massive data set at sub-second latency in 3 steps. 1. Identify a Star Schema data on Hadoop. 2. Build Cube on Hadoop. 3. Query data with ANSI-SQL and get results via ODBC, JDBC or RESTful API. Copyright © William El Kaim 2016 62Source: Apache Kylin
  63. 63. What is BI on Hadoop? SQL-on-Hadoop: Synthesis • Pros • Continue using your favorite BI tools and SQL-based clients • Tableau, Qlik, Power BI, Excel, R, SAS, … • Technical analysts can write custom SQL queries • Cons • Another layer in your data stack • May need to pre-aggregate the data depending on your scale • Need a separate data preparation tool (or custom scripts) Source: DremioCopyright © William El Kaim 2016 63
  64. 64. What is BI on Hadoop? SQL-on-Hadoop: Synthesis Source: DremioCopyright © William El Kaim 2016 64
  65. 65. What is BI on Hadoop? SQL-on-Hadoop: Encoding Formats • The different encoding standards result in different block sizes, and that can impact performance. • ORC files compress smaller than Parquet files which can be a decisive choice factor. • Impala, for example, accesses HDFS data that’s encoded in the Parquet format, while Hive and others support optimized row column (ORC) files, sequence files, or plain text. • Semi-structured data format like JSON is gaining traction • Previously Hadoop users were using MapReduce to pound unstructured data into a more structured or relational format. • Drill opened up SQL-based access directly to semi-structured data, such as JSON, which is a common format found on NoSQL and SQL databases. Cloudera also recently added support for JSON in Impala. Source: DatanamiCopyright © William El Kaim 2016 65
  66. 66. What is BI on Hadoop? SQL-on-Hadoop: Decision Tree Source: DremioCopyright © William El Kaim 2016 66
  67. 67. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 67
  68. 68. What is Data Science? • Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms Copyright © William El Kaim 2016 68
  69. 69. From Hindsight to Insight to Foresight Copyright © William El Kaim 2016 69
  70. 70. What is Data Science? Copyright © William El Kaim 2016 70Source: wikipedia Source: Forrester
  71. 71. What is Data Science? Algorithms In Decision Making Copyright © William El Kaim 2016 71
  72. 72. What is Data Science? Tools: Anaconda Copyright © William El Kaim 2016 72https://www.continuum.io/
  73. 73. What is Data Science? Tools: Dataiku DSS http://www.dataiku.com/dss/ Create Machine Learning ModelsCombine and Join Datasets Copyright © William El Kaim 2016 73
  74. 74. What is Data Science? Tools: IBM Data Science Experience • Cloud-based development environment for near real-time, high performance analytics • Available on IBM Cloud Bluemix platform • Provides • 250 curated data sets • Open source tools and a collaborative workspace like H2O, RStudio, Jupyter Notebooks on Apache Spark Copyright © William El Kaim 2016 74http://datascience.ibm.com/
  75. 75. What is Data Science? Tools: Tamr http://www.tamr.com/Copyright © William El Kaim 2016 75
  76. 76. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 76
  77. 77. Hadoop Processing Paradigms Batch processing • Large amount of statics data • Generally incurs a high-latency / Volume Real-time processing • Compute streaming data • Low latency • Velocity Hybrid computation • Lambda Architecture • Volume + Velocity Copyright © William El Kaim 2016 77Source: Rubén Casado & Cloudera
  78. 78. Storage Technologies: Cost & Speed Copyright © William El Kaim 2016 78
  79. 79. Storage Technology: Encoding Format • Apache Parquet vs. Apache Orc Copyright © William El Kaim 2016 79
  80. 80. Hadoop Batch processing Copyright © William El Kaim 2016 80 • Scalable • Large amount of static data • Distributed • Parallel • Fault tolerant • High latency Volume Source: Rubén Casado
  81. 81. • MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. • Key Terminology • Job: A “full program” - an execution of a Mapper and Reducer across a data set • Task: An execution of a Mapper or a Reducer on a slice of data – a.k.a. Task- In-Progress (TIP) • Task Attempt: A particular instance of an attempt to execute a task on a machine Hadoop – Batch Processing - Map Reduce 81Copyright © William El Kaim 2016
  82. 82. • Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). • MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to reduce the distance over which it must be transmitted. • "Map" step • Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. • A master node ensures that only one copy of redundant input data is processed. • "Shuffle" step • Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. • "Reduce" step • Worker nodes now process each group of output data, per key, in parallel. Hadoop – Batch Processing - Map Reduce 82Copyright © William El Kaim 2016
  83. 83. Hadoop – Batch Processing - Map Reduce Copyright © William El Kaim 2016 83
  84. 84. Real-time Processing Copyright © William El Kaim 2016 84 • Low latency • Continuous unbounded streams of data • Distributed • Parallel • Fault-tolerant Velocity Source: Rubén Casado
  85. 85. • Computational model and Infrastructure for continuous data processing, with the ability to produce low-latency results • Data collected continuously is naturally processed continuously (Event Processing or Complex Event Processing -CEP) Real-time (Stream) Processing 85Copyright © William El Kaim 2016 Source: Trivadis
  86. 86. Real-time (Stream) Processing • (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently • Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives statefull computation, making windowing an easy task Copyright © William El Kaim 2016 86Source: Trivadis
  87. 87. Real-time (Stream) Processing Copyright © William El Kaim 2016 87Source: Trivadis
  88. 88. Real-time (Stream) Processing Arch. Pattern Copyright © William El Kaim 2016 88Source: Cloudera
  89. 89. Hybrid Computation Model Copyright © William El Kaim 2016 89 • Low latency • Massive data + Streaming data • Scalable • Combine batch and real-time results Volume Velocity Source: Rubén Casado
  90. 90. Hybrid Computation: Lambda Architecture • Data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. • A system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. • This approach to architecture attempts to balance latency, throughput, and fault- tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. • The two view outputs may be joined before presentation. • Lambda Architecture case stories via lambda-architecture.net Copyright © William El Kaim 2016 90Source: Kreps
  91. 91. • Batch layer • Receives arriving data, combines it with historical data and recomputes results by iterating over the entire combined data set. • The batch layer has two major tasks: • managing historical data; and recomputing results such as machine learning models. • Operates on the full data and thus allows the system to produce the most accurate results. However, the results come at the cost of high latency due to high computation time. • The speed layer • Is used in order to provide results in a low-latency, near real-time fashion. • Receives the arriving data and performs incremental updates to the batch layer results. • Thanks to the incremental algorithms implemented at the speed layer, computation cost is significantly reduced. • The serving layer enables various queries of the results sent from the batch and speed layers. Hybrid Computation: Lambda Architecture 91Copyright © William El Kaim 2016 Source: Kreps
  92. 92. Hybrid computation: Lambda Architecture Copyright © William El Kaim 2016 92Source: Mapr
  93. 93. Hybrid computation: Kappa Architecture • Proposal from Jay Kreps (LinkedIn) in this article. • Then talk “Turning the database inside out with Apache Samza” by Martin Kleppmann • Main objective • Avoid maintaining two separate code bases for the batch and speed layers (lambda). • Key benefits • Handle both real-time data processing and continuous data reprocessing using a single stream processing engine. • Data reprocessing is an important requirement for making visible the effects of code changes on the results. Copyright © William El Kaim 2016 93Source: Kreps
  94. 94. Hybrid computation: Kappa Architecture • Architecture is composed of only two layers: • The stream processing layer runs the stream processing jobs. • Normally, a single stream processing job is run to enable real-time data processing. • Data reprocessing is only done when some code of the stream processing job needs to be modified. • This is achieved by running another modified stream processing job and replying all previous data. • The serving layer is used to query the results (like the Lambda architecture). Copyright © William El Kaim 2016 94Source: O’Reilly
  95. 95. Hybrid computation: Lambda vs. Kappa Copyright © William El Kaim 2016 95 Lambda Kappa Source: Kreps Used to value all data in a unique treatment chain Used to provide the freshest data to customers
  96. 96. Hadoop Processing Paradigms Evolutions Copyright © William El Kaim 2016 96 Source: Rubén Casado Tejedor
  97. 97. Hadoop Architecture Copyright © William El Kaim 2016 97 Metadata Source Data Computed Data Data Lake Data Preparation Data Science Tools & Platforms Open Data Operational Systems (ODS, IoT) Existing Sources of Data (Databases, DW, DataMart) Data Sourcing Dataiku, Tableau, Python, R, etc. Batch Streaming Data IngestionData Sources Data Driven Business Process, Applications and Services Lakeshore SQL & NoSQL Database Hadoop Spark BI Tools & Platforms Qlik, Tibco, IBM, SAP, BIME, etc. App. Services Cascading, Crunch, Hfactory, Hunk, Spring for Hadoop Feature Preparation
  98. 98. Hadoop Technologies Copyright © William El Kaim 2016 98 NoSQL Databases: Cassandra, Ceph, DynamoDB, Hbase, Hive, Impala, Ring, OpenStack Swift, etc. Data Lake Data PreparationData Sourcing Data Science: Dataiku, Datameer, Tamr, R, SaS, Python, RapidMiner, etc. Data IngestionData Sources BI Tools & Platforms App. Services Cascading, Crunch, Hfactory, Hunk, Spring for Hadoop, D3.js, Leaflet Feature Preparation Ingestion Technologies: Apex, Flink, Flume, Kafka, Amazon Kinesis, Nifi, Samza, Spark, Sqoop, Scribe, Storm, NFS Gateway, etc. Distributed File System: GlusterFS, HDFS, Amazon S3, MapRFS, ElasticSearch Batch Streaming Encoding Format: JSON, Rcfile, Parquet, ORCfile Map Reduce Event Stream & Micro Batch Open Data Operational Systems (ODS, IoT) Existing Sources of Data (Databases, DW, DataMart) Distributions: Cloudera, HortonWorks, MapR, SyncFusion, Amazon EMR, Azure HDInsight, Altiscale, Pachyderm, Qubole, etc. Data Warehouse Lakeshore & Analytics Qlik, Tableau, Tibco, Jethro, Looker, IBM, SAP, BIME, etc. Cassandra, Druid, DynamoDB, MongoDB, Redshift, Google BigQuery, etc. Machine Learning: BigML, Mahout, Predicsys, Azure ML, TensorFlow, H2O, etc. Analytics App and Services Copyright © William El Kaim 2016
  99. 99. Plan • Taming The Data Deluge • What is Big Data? • What is Hadoop? • When to use Hadoop? • How to Implement Hadoop? • What is a Data Lake? • What is BI on Hadoop? • What is Data Science? • Hadoop Processing Paradigms • Hadoop Tools Landscape Copyright © William El Kaim 2016 99
  100. 100. Big Data Landscape Hadoop Distributors • Three Main pure-play Hadoop distributors • Cloudera, Hortonworks, and MapR Technologies • Other Hadoop distributors • SyncFusion: Hadoop for Windows, • Pivotal Big Data Suite • Pachyderm • Hadoop Cloud Provider: • Altiscale, Amazon EMR, BigStep, Google Cloud DataProc, HortonWorks SequenceIQ, IBM BigInsights, Microsoft HDInsight, Oracle Big Data, Qubole, Rackspace • Hadoop Infrastructure as a Service • BlueData, Packet Copyright © William El Kaim 2016 100 Forrester Big Data Hadoop Cloud, Q1 2016
  101. 101. Big Data Landscape Data Preparation Copyright © William El Kaim 2016 101 Source: Bloor Research
  102. 102. Big Data Landscape Business Intelligence and Analytics Platforms Copyright © William El Kaim 2016 102
  103. 103. Big Data Landscape Hadoop Distributions To Start With • Apache Hadoop • Cloudera Live • Dataiku • Hortonworks Sandbox • IBM BigInsights • MapR Sandbox • Microsoft Azure HDInsight • Syncfusion Hadoop for Windows • W3C Big Data Europe Platform Copyright © William El Kaim 2016 103
  104. 104. 104Copyright © William El Kaim 2016 Twitter http://www.twitter.com/welkaim SlideShare http://www.slideshare.net/welkaim EA Digital Codex http://www.eacodex.com/ Linkedin http://fr.linkedin.com/in/williamelkaim Claudine O'Sullivan

×