Big Data
Technologies for Enterprise Analytics
Big Data
Technologies
Classification of Big Data technologies
Apache Hadoop
Pentaho & Big Data
Enterprise Analytics
About StrateBI
Big Data
Big Data
We understand Big Data as the result of the following changes
that are taking place in the data managed by organizations
The increased Volume of the data available in companies
From Terabytes (103 Gb) to Petabytes (106)
The significant increase in the Variety or heterogeneity of data
sources available
Structured, Semi structured and Unstructured data must be processed
Increased Velocity of generation and distribution of data sources
The above are the main questions to determine if we have a Big
Data scenario
Big Data
Big Data technologies
Business intelligence (BI) traditional tools and processes have
been overtaken by the nature of Big Data
This situation has led to the rise and development of a wide
range of technologies for Big Data management
Most of current Big Data technologies are Open Source
Know-How: A major problem
Which technologies use on each Big Data scenario?
How to combine them to be successful and monetize Big Data
management?
Big Data
Big Data
Classification of Big Data technologies
Big Data technologies fall into 3 groups
Big Data
Classification of Big Data technologies
Apache Hadoop:
A framework that allows for the distributed processing of Big
Data
Commodity cluster computing: It is designed to scale up
from single servers to thousands of machines
More general approach than the other Big Data
technologies:
Simple programming models for supporting a wide range of
applications: MapReduce, Tez, Hive, Pig, Spark...
Applications: Ingestion, Processing (Batch & Real Time), ETL,
SQL, Machine Learning, NoSQL, Reporting, OLAP…
Big Data
Classification of Big Data technologies
Apache Hadoop in its most basic form consists of:
HDFS: A distributed file system
YARN: A framework for job scheduling and cluster resource
management
MapReduce: A YARN-based system for parallel processing of
large data sets
Big Data
Classification of Big Data technologies
NoSQL databases
Storing and querying especially for semi-structured data
Usually they implement distributed storage and processing
Aimed to replace the operational databases in Big Data scenarios:
Less general approach than Hadoop
Some form of support for transaction management
Optimized for random reads and writes
Big Data
Classification of Big Data technologies
Extended RDBMS
Add features to traditional databases for storing and processing
huge volumes of relational information (mainly structured data)
Including libraries of advanced analytical functions and supporting User
Defined Functions (UDF)
Usually they allows for distributed storage or processing
Some of them implements columnar storage: Optimized for analytical
workload (sums, counts, averages, maximums,…)
One important subtype are MPP (Massive Parallel Processing)
databases
HP Vertica, Pivotal Greemplum
Well suited for OLAP applications
Big Data
Classification of Big Data technologies
An alternative classification: based on their role in a Big Data
architecture
Big Data
Ingestion Storage Processing Orchestration Analysis Visualization
We provide the best technology for each application
1. Enterprise Data Warehouse Extension:
Big Data scenarios in where we would like to implement low latency
analytics such as OLAP, dashboard, reporting,…
Big Data
We provide the best technology for each application
2. Website clickstream analysis :
Big Data
We provide the best technology for each application
2. Website clickstream analysis – Visualization Technologies
Apache Zeppelin
http://zeppelin-project.org/demo.html
Big Data
We provide the best technology for each application
3. Real Time analytics
Data streams processing, instead of static data sets, as in the batch
processing
Big Data
Syslog
Source
Avro Sink
Kafka
Channel
HDFS Sink
HBase Sink
Others
Sinks
Real Time
Processing
Persistence
Visualizations
for analysis
Apache
HTTP
Server 1
Apache
HTTP
Server 2
Apache
HTTP
Server N
We provide the best technology for each application
3. Real Time analytics – Processing Technologies
Big Data
Interceptor Trident API
Processing latency 0,05 a 0,5 sec 0,05 a 0,5 sec 0,5 a 30 sec 0,5 a 30 sec
Agreggations and
Windowing averages
Yes, but not Fault-
Tolerant
Not supported Yes, Faul-Tolerant Yes, Faul-Tolerant
Record level
enrichment and alerts
Yes Yes Yes Yes
Persistence of
transient data
Yes, but poor
performance
Yes, high performance
with HDFS, Hbase…
Yes, high performance
with HDFS, HBase…
Yes, high performance
with HDFS, HBase…
High-Level Functions No. It requires a lot of
code
Yes. Very simple,
configuration-based tool
Yes. Joins, aggregations,
.... Easier programming
than Storm
Yes, a lot of libraries of
functions. Easier
programming than
Storm and Trident.
Reliability Duplicates and data loss More reliable than
Storm and Trident
More reliable than
Storm
More reliable than
Storm and Trident
We provide the best technology for each application
3. Real Time analytics – Visualization Technologies
JavaScript Charts libraries (D3, Highcharts…) using Sockets connections
Big Data
We provide the best technology for each application
3. Real Time analytics – Visualization Technologies
JavaScript Charts libraries (D3, Highcharts…) using Sockets connections
Big Data
We provide the best technology for each application
3. Real Time analytics – A StrateBI case study
Wikipedia updates – Demo StrateBI
http://bigdata.stratebi.com/
Big Data
We provide the best technology for each application
3. Real Time analytics – More Technologies
Apache Hue + Solr
Big Data
Syslog
Source
Solr Sink
Kafka
Channel
Solr
Real Time
Indexing
Hue
Visualizations
for analysis
Apache
HTTP
Server 1
Apache
HTTP
Server 2
Apache
HTTP
Server N
We provide the best technology for each application
3. Real Time analytics – More Technologies
Apache Hue + Solr
Big Data
We provide the best technology for each application
4. Fraud detection system:
Big Data
Hadoop Distributions
Separately installation and maintenance of Hadoop tools may
become a serious issue
Hadoop Distributions: Software package that includes the basic
Hadoop components, along with others common and useful tools
of the current Hadoop Stack
In some cases distributions adds improvements or, even, not Open
Source tools (e.g. Cloudera Manager)
Main benefits
Packages or installer: Easy to install Hadoop on different operating
systems such as Ubuntu, CentOS, Debian, Windows Server ...
Easy patch management
Big Data
Hadoop distributions recommended by StrateBI
Hortonworks HDP: http://hortonworks.com/
The only 100% Open Source Hadoop Distribution
Only includes the latest stable versions of Hadoop stack tools
Big Data
Hadoop distributions recommended by StrateBI
Cloudera: http://www.cloudera.com
Express (free) and Enterprise (comercial) versions
They include tools improvements that have not yet been
incorporated into Apache open source projects
Cloudera Manager: A proprietary tool for Hadoop cluster
management and monitoring
Quite good and very reliable tool
In its free version it does not support some features that Apache
Ambari does support for cluster management in Hortonworks
Users and roles definition, LDAP integration, management of
some Hadoop services (Impala, Spark, etc ...), hot updates of
cluster tools...
Big Data
Pentaho & Big Data
The suite of Business Intelligence Pentaho has added improved
support for Big Data management, processing and visualization
Pentaho Data Integration
Visual and powerful ETL design and execution tool
Pentaho Reporting Designer
For creating static and parametrized reports
Pentaho Metadata Editor
To define metadata for Ad-Hoc reporting applications (e.g. STReport)
Pentaho BI Server
For developing and sharing reports, dashboards (e.g. STDashboard) and
OLAP Analysis (e.g. STPivot)
Big Data
Big Data
Pentaho & Big Data
Pentaho Data Integration 6.X
Fully integration with most common Hadoop Distributions
Cloudera 5.X, Hortonworks 2.X, Map R
Functionalities
ETL in-cluster execution: Pentaho automatically generates and launches
MapReduce code in the cluster
Reading, processing and writing data and files from and to HDFS
Processes Orchestration: MapReduce, Pig, Sqoop, Spark, Oozie
JDBC Connection with Apache and Apache Hive Impala
PDI has also support for NoSQL databases
Hbase, Mongo DB, Cassandra (up to version 2.1)
Big Data
Big Data
Hadoop cluster
connections
management
Transformations Steps
for data movement and
transformations
Jobs Entries for
Orchestration
Big Data
Some Big Data success stories:
Democratic Party presidential campaigns (Barack Obama)
Data integration from surveys, social networks, members database..
High accuracy in forecasting results per geographic area (> 99%)
Better management of campaign events, advertising placement ...
They won presidential elections in 2008 and 2012
Amazon recommendation system
Big Data
Some Big Data success stories:
Banks and insurance companies as Morgan Stanley and ING
Direct have adopted Big Data:
Fraud detection, risk analysis in loans and insurance, customer churn
prevention, ...
The UPS package delivery company invests $ 1 million a year in
Big Data
Uses the data generated by the sensors installed in their vehicles to optimize
the route / fuel consumption, maintenance, CO2 emissions ...
UPS saves 50 million dollars in gasoline a year through its management of
Big Data
Big Data
Some Big Data success stories:
T-Mobile USA uses Big Data to reduce churn rate
By integrating data from billing, calls and social networks
All raw data is being stored in a Hadoop Data Lake
Generates a 360 degree view of each customer used to attack
customer dissatisfaction
“Tribal” customer model
Identifying people who have high influence on others due to their large
social network  If this client switches telecom provider, it could
cause a domino effect
Customer Lifetime Value is calculated for each of these customers
Big Data
Some Big Data success stories:
T-Mobile USA uses Big Data to reduce churn rate
Churn expectancy of a customer is based on different analyses
Billing analysis: Where and how long a user calls or text with whom.
Calls going to different provider could indicate that social network of
the customer is switching
Drop call analysis: For example, proactively detect if the user has
limited coverage is his geographical area of usual movement to offer
solutions, such a new phone or a femtocell to extend coverage in
indoors locations
Sentiment analysis: Social network data combined with other data
collect from customer such as surveys or previous client complains
As a result, T- Mobile down churn rates by 50% in just one
quarter
Big Data
StrateBI & Big Data success stories:
StrateBI has successfully applied the previously discussed Big Data
technologies:
Big Data analysis for decision making in agriculture
Real time data generated by sensors installed in farms is ingested and
integrated with weather data sources, in order to generate alerts and
obtaining predictions
Social Network analysis
Technological surveillance for a security company
Detection and prevention of attacks or dangerous scenarios, by
analyzing data from social networks combined with customer data
Detecting trends in social networking for business digital content
management
Intelligent publishing content
Big Data
Real time analysis of Big Data for decision making in agriculture
Big Data
Analysis of data generated by a field of solar panels
Big Data
Detecting trends in social networking
Big Data
Why StrateBI for Big Data projects?
Big Data recognized specialists in Spain (Hadoop, Spark, Hive,
Flume, Hortonworks, Cloudera, Cassandra, HP Vertica…)
Backed by our projects and training performed with companies
such as Boeing, Telefónica Educación Digital (TED), Gobierno de
España, Schibsted Group, Prosegur, INCIBE (National Institute of
Cybersecurity)…
Spanish leaders of Open Source BI (Pentaho, Talend,
Mondrian, Ctools, Saiku…)
StrateBI has lead to production a hundreds of Business
Intelligence systems with Pentaho for large companies such as
BBVA, Telefónica, Globalia, Prosegur, ALD, Gobiernos de La
Rioja, Extremadura, Baleares, Eroski, Equifax, Unilever, Amnistía
Internacional, Caixa De Enginyers, Schibsted, etc…
About Us
Private Sector
About Us
Public Sector
About Us
www.TodoBI.com
info@stratebi.com
www.stratebi.com
More Info
Tel: 91.788.34.10
Madrid: Avenida de Brasil, 17, Planta 16
Barcelona: C/ Valencia, 63
Brasil: Av. Paulista, 37 4 andar
About Us

Stratebi Big Data

  • 1.
    Big Data Technologies forEnterprise Analytics
  • 2.
    Big Data Technologies Classification ofBig Data technologies Apache Hadoop Pentaho & Big Data Enterprise Analytics About StrateBI Big Data
  • 3.
    Big Data We understandBig Data as the result of the following changes that are taking place in the data managed by organizations The increased Volume of the data available in companies From Terabytes (103 Gb) to Petabytes (106) The significant increase in the Variety or heterogeneity of data sources available Structured, Semi structured and Unstructured data must be processed Increased Velocity of generation and distribution of data sources The above are the main questions to determine if we have a Big Data scenario Big Data
  • 4.
    Big Data technologies Businessintelligence (BI) traditional tools and processes have been overtaken by the nature of Big Data This situation has led to the rise and development of a wide range of technologies for Big Data management Most of current Big Data technologies are Open Source Know-How: A major problem Which technologies use on each Big Data scenario? How to combine them to be successful and monetize Big Data management? Big Data
  • 5.
  • 6.
    Classification of BigData technologies Big Data technologies fall into 3 groups Big Data
  • 7.
    Classification of BigData technologies Apache Hadoop: A framework that allows for the distributed processing of Big Data Commodity cluster computing: It is designed to scale up from single servers to thousands of machines More general approach than the other Big Data technologies: Simple programming models for supporting a wide range of applications: MapReduce, Tez, Hive, Pig, Spark... Applications: Ingestion, Processing (Batch & Real Time), ETL, SQL, Machine Learning, NoSQL, Reporting, OLAP… Big Data
  • 8.
    Classification of BigData technologies Apache Hadoop in its most basic form consists of: HDFS: A distributed file system YARN: A framework for job scheduling and cluster resource management MapReduce: A YARN-based system for parallel processing of large data sets Big Data
  • 9.
    Classification of BigData technologies NoSQL databases Storing and querying especially for semi-structured data Usually they implement distributed storage and processing Aimed to replace the operational databases in Big Data scenarios: Less general approach than Hadoop Some form of support for transaction management Optimized for random reads and writes Big Data
  • 10.
    Classification of BigData technologies Extended RDBMS Add features to traditional databases for storing and processing huge volumes of relational information (mainly structured data) Including libraries of advanced analytical functions and supporting User Defined Functions (UDF) Usually they allows for distributed storage or processing Some of them implements columnar storage: Optimized for analytical workload (sums, counts, averages, maximums,…) One important subtype are MPP (Massive Parallel Processing) databases HP Vertica, Pivotal Greemplum Well suited for OLAP applications Big Data
  • 11.
    Classification of BigData technologies An alternative classification: based on their role in a Big Data architecture Big Data Ingestion Storage Processing Orchestration Analysis Visualization
  • 12.
    We provide thebest technology for each application 1. Enterprise Data Warehouse Extension: Big Data scenarios in where we would like to implement low latency analytics such as OLAP, dashboard, reporting,… Big Data
  • 13.
    We provide thebest technology for each application 2. Website clickstream analysis : Big Data
  • 14.
    We provide thebest technology for each application 2. Website clickstream analysis – Visualization Technologies Apache Zeppelin http://zeppelin-project.org/demo.html Big Data
  • 15.
    We provide thebest technology for each application 3. Real Time analytics Data streams processing, instead of static data sets, as in the batch processing Big Data Syslog Source Avro Sink Kafka Channel HDFS Sink HBase Sink Others Sinks Real Time Processing Persistence Visualizations for analysis Apache HTTP Server 1 Apache HTTP Server 2 Apache HTTP Server N
  • 16.
    We provide thebest technology for each application 3. Real Time analytics – Processing Technologies Big Data Interceptor Trident API Processing latency 0,05 a 0,5 sec 0,05 a 0,5 sec 0,5 a 30 sec 0,5 a 30 sec Agreggations and Windowing averages Yes, but not Fault- Tolerant Not supported Yes, Faul-Tolerant Yes, Faul-Tolerant Record level enrichment and alerts Yes Yes Yes Yes Persistence of transient data Yes, but poor performance Yes, high performance with HDFS, Hbase… Yes, high performance with HDFS, HBase… Yes, high performance with HDFS, HBase… High-Level Functions No. It requires a lot of code Yes. Very simple, configuration-based tool Yes. Joins, aggregations, .... Easier programming than Storm Yes, a lot of libraries of functions. Easier programming than Storm and Trident. Reliability Duplicates and data loss More reliable than Storm and Trident More reliable than Storm More reliable than Storm and Trident
  • 17.
    We provide thebest technology for each application 3. Real Time analytics – Visualization Technologies JavaScript Charts libraries (D3, Highcharts…) using Sockets connections Big Data
  • 18.
    We provide thebest technology for each application 3. Real Time analytics – Visualization Technologies JavaScript Charts libraries (D3, Highcharts…) using Sockets connections Big Data
  • 19.
    We provide thebest technology for each application 3. Real Time analytics – A StrateBI case study Wikipedia updates – Demo StrateBI http://bigdata.stratebi.com/ Big Data
  • 20.
    We provide thebest technology for each application 3. Real Time analytics – More Technologies Apache Hue + Solr Big Data Syslog Source Solr Sink Kafka Channel Solr Real Time Indexing Hue Visualizations for analysis Apache HTTP Server 1 Apache HTTP Server 2 Apache HTTP Server N
  • 21.
    We provide thebest technology for each application 3. Real Time analytics – More Technologies Apache Hue + Solr Big Data
  • 22.
    We provide thebest technology for each application 4. Fraud detection system: Big Data
  • 23.
    Hadoop Distributions Separately installationand maintenance of Hadoop tools may become a serious issue Hadoop Distributions: Software package that includes the basic Hadoop components, along with others common and useful tools of the current Hadoop Stack In some cases distributions adds improvements or, even, not Open Source tools (e.g. Cloudera Manager) Main benefits Packages or installer: Easy to install Hadoop on different operating systems such as Ubuntu, CentOS, Debian, Windows Server ... Easy patch management Big Data
  • 24.
    Hadoop distributions recommendedby StrateBI Hortonworks HDP: http://hortonworks.com/ The only 100% Open Source Hadoop Distribution Only includes the latest stable versions of Hadoop stack tools Big Data
  • 25.
    Hadoop distributions recommendedby StrateBI Cloudera: http://www.cloudera.com Express (free) and Enterprise (comercial) versions They include tools improvements that have not yet been incorporated into Apache open source projects Cloudera Manager: A proprietary tool for Hadoop cluster management and monitoring Quite good and very reliable tool In its free version it does not support some features that Apache Ambari does support for cluster management in Hortonworks Users and roles definition, LDAP integration, management of some Hadoop services (Impala, Spark, etc ...), hot updates of cluster tools... Big Data
  • 26.
    Pentaho & BigData The suite of Business Intelligence Pentaho has added improved support for Big Data management, processing and visualization Pentaho Data Integration Visual and powerful ETL design and execution tool Pentaho Reporting Designer For creating static and parametrized reports Pentaho Metadata Editor To define metadata for Ad-Hoc reporting applications (e.g. STReport) Pentaho BI Server For developing and sharing reports, dashboards (e.g. STDashboard) and OLAP Analysis (e.g. STPivot) Big Data
  • 27.
  • 28.
    Pentaho & BigData Pentaho Data Integration 6.X Fully integration with most common Hadoop Distributions Cloudera 5.X, Hortonworks 2.X, Map R Functionalities ETL in-cluster execution: Pentaho automatically generates and launches MapReduce code in the cluster Reading, processing and writing data and files from and to HDFS Processes Orchestration: MapReduce, Pig, Sqoop, Spark, Oozie JDBC Connection with Apache and Apache Hive Impala PDI has also support for NoSQL databases Hbase, Mongo DB, Cassandra (up to version 2.1) Big Data
  • 29.
    Big Data Hadoop cluster connections management TransformationsSteps for data movement and transformations Jobs Entries for Orchestration
  • 30.
  • 31.
    Some Big Datasuccess stories: Democratic Party presidential campaigns (Barack Obama) Data integration from surveys, social networks, members database.. High accuracy in forecasting results per geographic area (> 99%) Better management of campaign events, advertising placement ... They won presidential elections in 2008 and 2012 Amazon recommendation system Big Data
  • 32.
    Some Big Datasuccess stories: Banks and insurance companies as Morgan Stanley and ING Direct have adopted Big Data: Fraud detection, risk analysis in loans and insurance, customer churn prevention, ... The UPS package delivery company invests $ 1 million a year in Big Data Uses the data generated by the sensors installed in their vehicles to optimize the route / fuel consumption, maintenance, CO2 emissions ... UPS saves 50 million dollars in gasoline a year through its management of Big Data Big Data
  • 33.
    Some Big Datasuccess stories: T-Mobile USA uses Big Data to reduce churn rate By integrating data from billing, calls and social networks All raw data is being stored in a Hadoop Data Lake Generates a 360 degree view of each customer used to attack customer dissatisfaction “Tribal” customer model Identifying people who have high influence on others due to their large social network  If this client switches telecom provider, it could cause a domino effect Customer Lifetime Value is calculated for each of these customers Big Data
  • 34.
    Some Big Datasuccess stories: T-Mobile USA uses Big Data to reduce churn rate Churn expectancy of a customer is based on different analyses Billing analysis: Where and how long a user calls or text with whom. Calls going to different provider could indicate that social network of the customer is switching Drop call analysis: For example, proactively detect if the user has limited coverage is his geographical area of usual movement to offer solutions, such a new phone or a femtocell to extend coverage in indoors locations Sentiment analysis: Social network data combined with other data collect from customer such as surveys or previous client complains As a result, T- Mobile down churn rates by 50% in just one quarter Big Data
  • 35.
    StrateBI & BigData success stories: StrateBI has successfully applied the previously discussed Big Data technologies: Big Data analysis for decision making in agriculture Real time data generated by sensors installed in farms is ingested and integrated with weather data sources, in order to generate alerts and obtaining predictions Social Network analysis Technological surveillance for a security company Detection and prevention of attacks or dangerous scenarios, by analyzing data from social networks combined with customer data Detecting trends in social networking for business digital content management Intelligent publishing content Big Data
  • 36.
    Real time analysisof Big Data for decision making in agriculture Big Data
  • 37.
    Analysis of datagenerated by a field of solar panels Big Data
  • 38.
    Detecting trends insocial networking Big Data
  • 39.
    Why StrateBI forBig Data projects? Big Data recognized specialists in Spain (Hadoop, Spark, Hive, Flume, Hortonworks, Cloudera, Cassandra, HP Vertica…) Backed by our projects and training performed with companies such as Boeing, Telefónica Educación Digital (TED), Gobierno de España, Schibsted Group, Prosegur, INCIBE (National Institute of Cybersecurity)… Spanish leaders of Open Source BI (Pentaho, Talend, Mondrian, Ctools, Saiku…) StrateBI has lead to production a hundreds of Business Intelligence systems with Pentaho for large companies such as BBVA, Telefónica, Globalia, Prosegur, ALD, Gobiernos de La Rioja, Extremadura, Baleares, Eroski, Equifax, Unilever, Amnistía Internacional, Caixa De Enginyers, Schibsted, etc… About Us
  • 40.
  • 41.
  • 42.
    www.TodoBI.com info@stratebi.com www.stratebi.com More Info Tel: 91.788.34.10 Madrid:Avenida de Brasil, 17, Planta 16 Barcelona: C/ Valencia, 63 Brasil: Av. Paulista, 37 4 andar About Us

Editor's Notes

  • #9  NoSQL: Bases de datos para el almacenamiento y consulta de datos, principalmente semi estructurados Soporte para transacciones y optimizada para lecturas y escrituras aleatorias  Aplicaciones operacionales
  • #25 http://5.196.203.197:8080
  • #43 Referencias y Datos de Contacto