The Six Pillars for Building Big
Data Analytics Ecosystems
Big Data and Analytics
What ?
◦ Any voluminous amount of
◦ Structure
◦ semi-structured
◦ unstructured data
Where
◦ Large organizations
Why ?
◦ Cost reduction
◦ Faster, better decision making
◦ New products and services
Big data Analytics Ecosystems
◦ Data explorations
◦ Data preparation
◦ Modeling
Pillars of Big Data: Overview
Pillars of Big Data: Storage
RDBMS
◦ Ensures ACID
◦ Performance and scalability
DFS
◦ Client server architecture
◦ Hiding information from use e.g location
◦ Concurrency transparency
◦ Failure transparency
◦ Replication and Scalability transparency
◦ E.g GFS, HDFS,CFS
No SQL
◦ Sacrifice the consistency, to have high availability and scalability
◦ Data store as key/value pairs
◦ Supports three types
◦ E.g MVCC, COD, DOD,Graph
Pillars of Big Data: Processing
Batch Processing
◦ Execute a series of jobs without manual intervention
◦ E.g Hadoop
◦ Real life example
◦ Credit card
◦ Map Reduce
◦ Map
◦ Shuffle
◦ Reduce
Interactive Processing
◦ Requires human interaction
◦ Real life example
◦ Spreadsheets
Pillars of Big Data: Processing
Iterative Processing
◦ Machine learning operations
◦ Requires several passes for the algorithm to converge
◦ HaLoop, Main Memory MapReduce(M3R)
◦ Real life example
◦ Evaluation of mathematical expression
Incremental Processing
◦ Analyze data in motion
◦ Requires quick actions
◦ Full data is not required for algorithm
◦ E.g Apache Storm, Microsoft Trill
◦ Real life example
◦ Check on incoming data stream for security
Pillars of Big Data: Processing
Approximate Processing
◦ Quick retrieval of approximate results from a small sample
◦ E.g Early Accurate Result Library (EARL), Blink DB
In-Database Processing
◦ In database machine learning
◦ Microsoft SQL Server Analysis Services (SSAS)41
Pillars of Big Data: Analytics
Orchestration
Orchestrate complex analytic jobs and workflows to achieve the user’s goals
Scheduling
. Resource Utilization
Resources : Memory, CPU, Network and Disk
the idea of effective resource utilization to mitigate idle
resources
. Hadoop 1.0 Shortcomings
. Apache Hadoop YARN
. Data Locality
ensure data and processing on same node to avoid network
congestion
Pillars of Big Data: Analytics
Orchestration
Provisioning
. Resource Provisioning
Resource allocation to jobs with minimal cost & execution time
. Resource Set (RS) Maximizer
. Conductor
. Data Provisioning
Pillars of Big Data: Analytics
Assistance
narrowing analytics talent gap by magnifying internal skill set using in tool assistance
Static Assistance
. Tooltips
. Help Pages
. Wizards
Intelligent Assistance
. Data Preparation
determining and converting irrelevant data/attributes to meaningful info
Pillars of Big Data: Analytics
Assistance
. Selecting Operations
. Expert Systems (ES)
. Meta-Learning Systems (MLSs)
. Ontology Reasoners (OR)
. Automatic workflow generation
Provide a workflow based on the input data and existing problem
. Fault Detection and Handling
When the data is Big Data, failure in the middle is a catastrophe
Pillars of Big Data: User Interfaces
Full power of analytics solutions are limited to relevant users.
Five approaches for user interfaces:
 Scripts
 SQL-based Interfaces
 Graph based interfaces
 Sheets
 Visualizations
Pillars of Big Data: User Interfaces
Scripts:
Analytics at programming level
Interface’s can be CLI or API
Low level coding
Supports data mining
Mostly avoided by a normal user
Such as: R for statisticians, Matlab and
weka
SQL-based Interfaces
Unified SQL interface – extended SQL
Use of UDF’s (User defined functions)
Further classification:
SQL-on-Hadoop
Machine learning SQL
Pillars of Big Data: User Interfaces
Graphs:
No need to code
Drag and drop
Panel (Operations) and canvas
(Processing)
Such as: Rapidminer, IBM SPSS modeler,
WINGS etc
Sheets:
Most fissile for business organization as it
deals with spreadsheets
Focused on data exploration in easiest
way
Compatible with moving data on another
solutions
Such as: Power query, Microsoft Tabular,
Google open refine etc
Pillars of Big Data: User Interfaces
Visualization
To control the high probability of analyzing the wrong or incompatible set of attributes
Suitable for large business firms
Lack of machine learning techniques
Such as: IBM Watson analytics, SAS visual Analytics etc
Pillars of Big Data: Deployment
Many components that needs to be integrated together
Deployment challenges includes
◦ Complexity
◦ Challenging
◦ Scope beyond the in house IT technicians
Pillars of Big Data: Deployment
Product:
Use of product deployment models to ensure privacy and security
◦Cost
◦IT-Staff
◦Limited Scalability
Most components are open source platforms but again integration is the major issue
Pillars of Big Data: Deployment
Service:
Services provided on demand, solution cost pay per user/data.
Security and privacy is an issue and cost of moving data to provider’s cloud.
Hybrid cloud
Data storage and processing residing on the organization infrastructure
Future Directions:
Each solution brings some features not available in the others, but also
adds some limitations and overheads.
While there has been a continuous improvement in analytics solutions to
address different analytics scenarios, there are still some gaps.
Conclusions:
Difficult to select suitable analytics solution because a weak component in
the ecosystem can cause the whole ecosystem to function inefficiently.
For each of these pillars, different approaches are discussed and popular
systems are presented.
The pillars form a taxonomy that aims to give an overview on the field, to
guide organizations and researchers to build their Big Data Analytics
Ecosystem, and help to identify challenges and opportunities in the field.

The Six pillars for Building big data analytics ecosystems

  • 1.
    The Six Pillarsfor Building Big Data Analytics Ecosystems
  • 2.
    Big Data andAnalytics What ? ◦ Any voluminous amount of ◦ Structure ◦ semi-structured ◦ unstructured data Where ◦ Large organizations Why ? ◦ Cost reduction ◦ Faster, better decision making ◦ New products and services Big data Analytics Ecosystems ◦ Data explorations ◦ Data preparation ◦ Modeling
  • 3.
    Pillars of BigData: Overview
  • 4.
    Pillars of BigData: Storage RDBMS ◦ Ensures ACID ◦ Performance and scalability DFS ◦ Client server architecture ◦ Hiding information from use e.g location ◦ Concurrency transparency ◦ Failure transparency ◦ Replication and Scalability transparency ◦ E.g GFS, HDFS,CFS No SQL ◦ Sacrifice the consistency, to have high availability and scalability ◦ Data store as key/value pairs ◦ Supports three types ◦ E.g MVCC, COD, DOD,Graph
  • 5.
    Pillars of BigData: Processing Batch Processing ◦ Execute a series of jobs without manual intervention ◦ E.g Hadoop ◦ Real life example ◦ Credit card ◦ Map Reduce ◦ Map ◦ Shuffle ◦ Reduce Interactive Processing ◦ Requires human interaction ◦ Real life example ◦ Spreadsheets
  • 6.
    Pillars of BigData: Processing Iterative Processing ◦ Machine learning operations ◦ Requires several passes for the algorithm to converge ◦ HaLoop, Main Memory MapReduce(M3R) ◦ Real life example ◦ Evaluation of mathematical expression Incremental Processing ◦ Analyze data in motion ◦ Requires quick actions ◦ Full data is not required for algorithm ◦ E.g Apache Storm, Microsoft Trill ◦ Real life example ◦ Check on incoming data stream for security
  • 7.
    Pillars of BigData: Processing Approximate Processing ◦ Quick retrieval of approximate results from a small sample ◦ E.g Early Accurate Result Library (EARL), Blink DB In-Database Processing ◦ In database machine learning ◦ Microsoft SQL Server Analysis Services (SSAS)41
  • 8.
    Pillars of BigData: Analytics Orchestration Orchestrate complex analytic jobs and workflows to achieve the user’s goals Scheduling . Resource Utilization Resources : Memory, CPU, Network and Disk the idea of effective resource utilization to mitigate idle resources . Hadoop 1.0 Shortcomings . Apache Hadoop YARN . Data Locality ensure data and processing on same node to avoid network congestion
  • 9.
    Pillars of BigData: Analytics Orchestration Provisioning . Resource Provisioning Resource allocation to jobs with minimal cost & execution time . Resource Set (RS) Maximizer . Conductor . Data Provisioning
  • 10.
    Pillars of BigData: Analytics Assistance narrowing analytics talent gap by magnifying internal skill set using in tool assistance Static Assistance . Tooltips . Help Pages . Wizards Intelligent Assistance . Data Preparation determining and converting irrelevant data/attributes to meaningful info
  • 11.
    Pillars of BigData: Analytics Assistance . Selecting Operations . Expert Systems (ES) . Meta-Learning Systems (MLSs) . Ontology Reasoners (OR) . Automatic workflow generation Provide a workflow based on the input data and existing problem . Fault Detection and Handling When the data is Big Data, failure in the middle is a catastrophe
  • 12.
    Pillars of BigData: User Interfaces Full power of analytics solutions are limited to relevant users. Five approaches for user interfaces:  Scripts  SQL-based Interfaces  Graph based interfaces  Sheets  Visualizations
  • 13.
    Pillars of BigData: User Interfaces Scripts: Analytics at programming level Interface’s can be CLI or API Low level coding Supports data mining Mostly avoided by a normal user Such as: R for statisticians, Matlab and weka SQL-based Interfaces Unified SQL interface – extended SQL Use of UDF’s (User defined functions) Further classification: SQL-on-Hadoop Machine learning SQL
  • 14.
    Pillars of BigData: User Interfaces Graphs: No need to code Drag and drop Panel (Operations) and canvas (Processing) Such as: Rapidminer, IBM SPSS modeler, WINGS etc Sheets: Most fissile for business organization as it deals with spreadsheets Focused on data exploration in easiest way Compatible with moving data on another solutions Such as: Power query, Microsoft Tabular, Google open refine etc
  • 15.
    Pillars of BigData: User Interfaces Visualization To control the high probability of analyzing the wrong or incompatible set of attributes Suitable for large business firms Lack of machine learning techniques Such as: IBM Watson analytics, SAS visual Analytics etc
  • 16.
    Pillars of BigData: Deployment Many components that needs to be integrated together Deployment challenges includes ◦ Complexity ◦ Challenging ◦ Scope beyond the in house IT technicians
  • 17.
    Pillars of BigData: Deployment Product: Use of product deployment models to ensure privacy and security ◦Cost ◦IT-Staff ◦Limited Scalability Most components are open source platforms but again integration is the major issue
  • 18.
    Pillars of BigData: Deployment Service: Services provided on demand, solution cost pay per user/data. Security and privacy is an issue and cost of moving data to provider’s cloud. Hybrid cloud Data storage and processing residing on the organization infrastructure
  • 19.
    Future Directions: Each solutionbrings some features not available in the others, but also adds some limitations and overheads. While there has been a continuous improvement in analytics solutions to address different analytics scenarios, there are still some gaps.
  • 20.
    Conclusions: Difficult to selectsuitable analytics solution because a weak component in the ecosystem can cause the whole ecosystem to function inefficiently. For each of these pillars, different approaches are discussed and popular systems are presented. The pillars form a taxonomy that aims to give an overview on the field, to guide organizations and researchers to build their Big Data Analytics Ecosystem, and help to identify challenges and opportunities in the field.

Editor's Notes

  • #3 What : Big data is an evolving term that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information. quickly. Increasingly, organizations’ success has become dependent on how quickly and efficiently they can turn the petabytes of data they collect into actionable information Data can be structured, which is generated by applications like Customer Relationship Management (CRM) and Enterprise Resource Planning (ERP) systems and typically stored in rows and columns with well-defined schemas. It can be semi-structured, which is generated by sensors, web feeds, event monitors, stock market feeds, and network and security systems. Where:With almost everything now online, organizations look at the Big Data collected to gain insights for improving their services. Why: Cost reduction Big data technologies such as Hadoop and cloud-based analytics bring significant cost advantages when it comes to storing large amounts of data – plus they can identify more efficient ways of doing business Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new sources of data, businesses are able to analyze information immediately – and make decisions based on what they’ve learned. New products and services. With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want. Davenport points out that with big data analytics, more companies are creating new products to meet customers’ needs. Data Exploration: Analysts go through the data, using ad-hoc queries and visualizations, to better understand the data; Data preparation: Analysts clean, prepare, and transform the data for modeling using batch processing to run computational and IO intensive operations; Data models are trained, using iterative processing, on the prepared data and trained models are used to score the unlabeled data.
  • #4 —Storage that handles the data’s huge volume, fast arrival, and multiple formats; —Processing that meets the Big Data Analytics processing needs; —Orchestration that manages available resources to reduce processing time and cost; —Assistance that goes beyond the interface and provides suggestions to help users with decisions when selecting operations and building their analytics process; —User Interface that provides users with a familiar environment to build and run their analytics; —Deployment Method that provides scalability, security, and reliability.
  • #5 ACID (Atomicity, Consistency, Isolation, and Durability) Recent RDBMSs developpments promise enhanced performance and scalability Hadoop File systems HDFS Cassandra File System (CFS) Voldemort and Riak use Multi Version Concurrency Control (MVCC) Column-Oriented Datab Document-Oriented Databasease
  • #6 MapReduce, as presented in Figure 3, consists of Map, Shuffle, and Reduce phases, which are executed sequentially, utilizing all nodes in the cluster. In the Map phase, the programmer-provided Map function (Mapper) processes the input data and outputs intermediate data in the form of <key, value> tuples which get stored on disk. The Shuffle phase then groups values to the same key together and sends them to the reduce nodes over the network. Finally, the programmer-provided Reduce function Reducer) reads the intermediate data from disk, processes it, and generates the final output.. Batch processing