The Six pillars for Building big data analytics ecosystems

The Six Pillars for Building Big
Data Analytics Ecosystems

Big Data and Analytics
What ?
◦ Any voluminous amount of
◦ Structure
◦ semi-structured
◦ unstructured data
Where
◦ Large organizations
Why ?
◦ Cost reduction
◦ Faster, better decision making
◦ New products and services
Big data Analytics Ecosystems
◦ Data explorations
◦ Data preparation
◦ Modeling

Pillars of Big Data: Storage
RDBMS
◦ Ensures ACID
◦ Performance and scalability
DFS
◦ Client server architecture
◦ Hiding information from use e.g location
◦ Concurrency transparency
◦ Failure transparency
◦ Replication and Scalability transparency
◦ E.g GFS, HDFS,CFS
No SQL
◦ Sacrifice the consistency, to have high availability and scalability
◦ Data store as key/value pairs
◦ Supports three types
◦ E.g MVCC, COD, DOD,Graph

Pillars of Big Data: Processing
Batch Processing
◦ Execute a series of jobs without manual intervention
◦ E.g Hadoop
◦ Real life example
◦ Credit card
◦ Map Reduce
◦ Map
◦ Shuffle
◦ Reduce
Interactive Processing
◦ Requires human interaction
◦ Spreadsheets

Iterative Processing
◦ Machine learning operations
◦ Requires several passes for the algorithm to converge
◦ HaLoop, Main Memory MapReduce(M3R)
◦ Evaluation of mathematical expression
Incremental Processing
◦ Analyze data in motion
◦ Requires quick actions
◦ Full data is not required for algorithm
◦ E.g Apache Storm, Microsoft Trill
◦ Check on incoming data stream for security

Approximate Processing
◦ Quick retrieval of approximate results from a small sample
◦ E.g Early Accurate Result Library (EARL), Blink DB
In-Database Processing
◦ In database machine learning
◦ Microsoft SQL Server Analysis Services (SSAS)41

Pillars of Big Data: Analytics
Orchestration
Orchestrate complex analytic jobs and workflows to achieve the user’s goals
Scheduling
. Resource Utilization
Resources : Memory, CPU, Network and Disk
the idea of effective resource utilization to mitigate idle
resources
. Hadoop 1.0 Shortcomings
. Apache Hadoop YARN
. Data Locality
ensure data and processing on same node to avoid network
congestion

Orchestration
Provisioning
. Resource Provisioning
Resource allocation to jobs with minimal cost & execution time
. Resource Set (RS) Maximizer
. Conductor
. Data Provisioning

Assistance
narrowing analytics talent gap by magnifying internal skill set using in tool assistance
Static Assistance
. Tooltips
. Help Pages
. Wizards
Intelligent Assistance
. Data Preparation
determining and converting irrelevant data/attributes to meaningful info

Assistance
. Selecting Operations
. Expert Systems (ES)
. Meta-Learning Systems (MLSs)
. Ontology Reasoners (OR)
. Automatic workflow generation
Provide a workflow based on the input data and existing problem
. Fault Detection and Handling
When the data is Big Data, failure in the middle is a catastrophe

Pillars of Big Data: User Interfaces
Full power of analytics solutions are limited to relevant users.
Five approaches for user interfaces:
 Scripts
 SQL-based Interfaces
 Graph based interfaces
 Sheets
 Visualizations

Scripts:
Analytics at programming level
Interface’s can be CLI or API
Low level coding
Supports data mining
Mostly avoided by a normal user
Such as: R for statisticians, Matlab and
weka
SQL-based Interfaces
Unified SQL interface – extended SQL
Use of UDF’s (User defined functions)
Further classification:
SQL-on-Hadoop
Machine learning SQL

Graphs:
No need to code
Drag and drop
Panel (Operations) and canvas
(Processing)
Such as: Rapidminer, IBM SPSS modeler,
WINGS etc
Sheets:
Most fissile for business organization as it
deals with spreadsheets
Focused on data exploration in easiest
way
Compatible with moving data on another
solutions
Such as: Power query, Microsoft Tabular,
Google open refine etc

Visualization
To control the high probability of analyzing the wrong or incompatible set of attributes
Suitable for large business firms
Lack of machine learning techniques
Such as: IBM Watson analytics, SAS visual Analytics etc

Pillars of Big Data: Deployment
Many components that needs to be integrated together
Deployment challenges includes
◦ Complexity
◦ Challenging
◦ Scope beyond the in house IT technicians

Product:
Use of product deployment models to ensure privacy and security
◦Cost
◦IT-Staff
◦Limited Scalability
Most components are open source platforms but again integration is the major issue

Service:
Services provided on demand, solution cost pay per user/data.
Security and privacy is an issue and cost of moving data to provider’s cloud.
Hybrid cloud
Data storage and processing residing on the organization infrastructure

Future Directions:
Each solution brings some features not available in the others, but also
adds some limitations and overheads.
While there has been a continuous improvement in analytics solutions to
address different analytics scenarios, there are still some gaps.

Conclusions:
Difficult to select suitable analytics solution because a weak component in
the ecosystem can cause the whole ecosystem to function inefficiently.
For each of these pillars, different approaches are discussed and popular
systems are presented.
The pillars form a taxonomy that aims to give an overview on the field, to
guide organizations and researchers to build their Big Data Analytics
Ecosystem, and help to identify challenges and opportunities in the field.

The Six pillars for Building big data analytics ecosystems

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to The Six pillars for Building big data analytics ecosystems

Similar to The Six pillars for Building big data analytics ecosystems (20)

Recently uploaded

Recently uploaded (20)

The Six pillars for Building big data analytics ecosystems

Editor's Notes