Stratebi Big Data

589 views

Published on

Big Data, Hadoop, Hortonworks, Pentaho, Cloudera, Cassandra, Spark, MongoDB

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
589
On SlideShare
0
From Embeds
0
Number of Embeds
292
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

  • NoSQL:
    Bases de datos para el almacenamiento y consulta de datos, principalmente semi estructurados
    Soporte para transacciones y optimizada para lecturas y escrituras aleatorias  Aplicaciones operacionales
  • http://5.196.203.197:8080
  • Referencias y Datos de Contacto
  • Stratebi Big Data

    1. 1. Big Data Technologies for Enterprise Analytics
    2. 2. Big Data Technologies Classification of Big Data technologies Apache Hadoop Pentaho & Big Data Enterprise Analytics About StrateBI Big Data
    3. 3. Big Data We understand Big Data as the result of the following changes that are taking place in the data managed by organizations The increased Volume of the data available in companies From Terabytes (103 Gb) to Petabytes (106) The significant increase in the Variety or heterogeneity of data sources available Structured, Semi structured and Unstructured data must be processed Increased Velocity of generation and distribution of data sources The above are the main questions to determine if we have a Big Data scenario Big Data
    4. 4. Big Data technologies Business intelligence (BI) traditional tools and processes have been overtaken by the nature of Big Data This situation has led to the rise and development of a wide range of technologies for Big Data management Most of current Big Data technologies are Open Source Know-How: A major problem Which technologies use on each Big Data scenario? How to combine them to be successful and monetize Big Data management? Big Data
    5. 5. Big Data
    6. 6. Classification of Big Data technologies Big Data technologies fall into 3 groups Big Data
    7. 7. Classification of Big Data technologies Apache Hadoop: A framework that allows for the distributed processing of Big Data Commodity cluster computing: It is designed to scale up from single servers to thousands of machines More general approach than the other Big Data technologies: Simple programming models for supporting a wide range of applications: MapReduce, Tez, Hive, Pig, Spark... Applications: Ingestion, Processing (Batch & Real Time), ETL, SQL, Machine Learning, NoSQL, Reporting, OLAP… Big Data
    8. 8. Classification of Big Data technologies Apache Hadoop in its most basic form consists of: HDFS: A distributed file system YARN: A framework for job scheduling and cluster resource management MapReduce: A YARN-based system for parallel processing of large data sets Big Data
    9. 9. Classification of Big Data technologies NoSQL databases Storing and querying especially for semi-structured data Usually they implement distributed storage and processing Aimed to replace the operational databases in Big Data scenarios: Less general approach than Hadoop Some form of support for transaction management Optimized for random reads and writes Big Data
    10. 10. Classification of Big Data technologies Extended RDBMS Add features to traditional databases for storing and processing huge volumes of relational information (mainly structured data) Including libraries of advanced analytical functions and supporting User Defined Functions (UDF) Usually they allows for distributed storage or processing Some of them implements columnar storage: Optimized for analytical workload (sums, counts, averages, maximums,…) One important subtype are MPP (Massive Parallel Processing) databases HP Vertica, Pivotal Greemplum Well suited for OLAP applications Big Data
    11. 11. Classification of Big Data technologies An alternative classification: based on their role in a Big Data architecture Big Data Ingestion Storage Processing Orchestration Analysis Visualization
    12. 12. We provide the best technology for each application 1. Enterprise Data Warehouse Extension: Big Data scenarios in where we would like to implement low latency analytics such as OLAP, dashboard, reporting,… Big Data
    13. 13. We provide the best technology for each application 2. Website clickstream analysis : Big Data
    14. 14. We provide the best technology for each application 2. Website clickstream analysis – Visualization Technologies Apache Zeppelin http://zeppelin-project.org/demo.html Big Data
    15. 15. We provide the best technology for each application 3. Real Time analytics Data streams processing, instead of static data sets, as in the batch processing Big Data Syslog Source Avro Sink Kafka Channel HDFS Sink HBase Sink Others Sinks Real Time Processing Persistence Visualizations for analysis Apache HTTP Server 1 Apache HTTP Server 2 Apache HTTP Server N
    16. 16. We provide the best technology for each application 3. Real Time analytics – Processing Technologies Big Data Interceptor Trident API Processing latency 0,05 a 0,5 sec 0,05 a 0,5 sec 0,5 a 30 sec 0,5 a 30 sec Agreggations and Windowing averages Yes, but not Fault- Tolerant Not supported Yes, Faul-Tolerant Yes, Faul-Tolerant Record level enrichment and alerts Yes Yes Yes Yes Persistence of transient data Yes, but poor performance Yes, high performance with HDFS, Hbase… Yes, high performance with HDFS, HBase… Yes, high performance with HDFS, HBase… High-Level Functions No. It requires a lot of code Yes. Very simple, configuration-based tool Yes. Joins, aggregations, .... Easier programming than Storm Yes, a lot of libraries of functions. Easier programming than Storm and Trident. Reliability Duplicates and data loss More reliable than Storm and Trident More reliable than Storm More reliable than Storm and Trident
    17. 17. We provide the best technology for each application 3. Real Time analytics – Visualization Technologies JavaScript Charts libraries (D3, Highcharts…) using Sockets connections Big Data
    18. 18. We provide the best technology for each application 3. Real Time analytics – Visualization Technologies JavaScript Charts libraries (D3, Highcharts…) using Sockets connections Big Data
    19. 19. We provide the best technology for each application 3. Real Time analytics – A StrateBI case study Wikipedia updates – Demo StrateBI http://bigdata.stratebi.com/ Big Data
    20. 20. We provide the best technology for each application 3. Real Time analytics – More Technologies Apache Hue + Solr Big Data Syslog Source Solr Sink Kafka Channel Solr Real Time Indexing Hue Visualizations for analysis Apache HTTP Server 1 Apache HTTP Server 2 Apache HTTP Server N
    21. 21. We provide the best technology for each application 3. Real Time analytics – More Technologies Apache Hue + Solr Big Data
    22. 22. We provide the best technology for each application 4. Fraud detection system: Big Data
    23. 23. Hadoop Distributions Separately installation and maintenance of Hadoop tools may become a serious issue Hadoop Distributions: Software package that includes the basic Hadoop components, along with others common and useful tools of the current Hadoop Stack In some cases distributions adds improvements or, even, not Open Source tools (e.g. Cloudera Manager) Main benefits Packages or installer: Easy to install Hadoop on different operating systems such as Ubuntu, CentOS, Debian, Windows Server ... Easy patch management Big Data
    24. 24. Hadoop distributions recommended by StrateBI Hortonworks HDP: http://hortonworks.com/ The only 100% Open Source Hadoop Distribution Only includes the latest stable versions of Hadoop stack tools Big Data
    25. 25. Hadoop distributions recommended by StrateBI Cloudera: http://www.cloudera.com Express (free) and Enterprise (comercial) versions They include tools improvements that have not yet been incorporated into Apache open source projects Cloudera Manager: A proprietary tool for Hadoop cluster management and monitoring Quite good and very reliable tool In its free version it does not support some features that Apache Ambari does support for cluster management in Hortonworks Users and roles definition, LDAP integration, management of some Hadoop services (Impala, Spark, etc ...), hot updates of cluster tools... Big Data
    26. 26. Pentaho & Big Data The suite of Business Intelligence Pentaho has added improved support for Big Data management, processing and visualization Pentaho Data Integration Visual and powerful ETL design and execution tool Pentaho Reporting Designer For creating static and parametrized reports Pentaho Metadata Editor To define metadata for Ad-Hoc reporting applications (e.g. STReport) Pentaho BI Server For developing and sharing reports, dashboards (e.g. STDashboard) and OLAP Analysis (e.g. STPivot) Big Data
    27. 27. Big Data
    28. 28. Pentaho & Big Data Pentaho Data Integration 6.X Fully integration with most common Hadoop Distributions Cloudera 5.X, Hortonworks 2.X, Map R Functionalities ETL in-cluster execution: Pentaho automatically generates and launches MapReduce code in the cluster Reading, processing and writing data and files from and to HDFS Processes Orchestration: MapReduce, Pig, Sqoop, Spark, Oozie JDBC Connection with Apache and Apache Hive Impala PDI has also support for NoSQL databases Hbase, Mongo DB, Cassandra (up to version 2.1) Big Data
    29. 29. Big Data Hadoop cluster connections management Transformations Steps for data movement and transformations Jobs Entries for Orchestration
    30. 30. Big Data
    31. 31. Some Big Data success stories: Democratic Party presidential campaigns (Barack Obama) Data integration from surveys, social networks, members database.. High accuracy in forecasting results per geographic area (> 99%) Better management of campaign events, advertising placement ... They won presidential elections in 2008 and 2012 Amazon recommendation system Big Data
    32. 32. Some Big Data success stories: Banks and insurance companies as Morgan Stanley and ING Direct have adopted Big Data: Fraud detection, risk analysis in loans and insurance, customer churn prevention, ... The UPS package delivery company invests $ 1 million a year in Big Data Uses the data generated by the sensors installed in their vehicles to optimize the route / fuel consumption, maintenance, CO2 emissions ... UPS saves 50 million dollars in gasoline a year through its management of Big Data Big Data
    33. 33. Some Big Data success stories: T-Mobile USA uses Big Data to reduce churn rate By integrating data from billing, calls and social networks All raw data is being stored in a Hadoop Data Lake Generates a 360 degree view of each customer used to attack customer dissatisfaction “Tribal” customer model Identifying people who have high influence on others due to their large social network  If this client switches telecom provider, it could cause a domino effect Customer Lifetime Value is calculated for each of these customers Big Data
    34. 34. Some Big Data success stories: T-Mobile USA uses Big Data to reduce churn rate Churn expectancy of a customer is based on different analyses Billing analysis: Where and how long a user calls or text with whom. Calls going to different provider could indicate that social network of the customer is switching Drop call analysis: For example, proactively detect if the user has limited coverage is his geographical area of usual movement to offer solutions, such a new phone or a femtocell to extend coverage in indoors locations Sentiment analysis: Social network data combined with other data collect from customer such as surveys or previous client complains As a result, T- Mobile down churn rates by 50% in just one quarter Big Data
    35. 35. StrateBI & Big Data success stories: StrateBI has successfully applied the previously discussed Big Data technologies: Big Data analysis for decision making in agriculture Real time data generated by sensors installed in farms is ingested and integrated with weather data sources, in order to generate alerts and obtaining predictions Social Network analysis Technological surveillance for a security company Detection and prevention of attacks or dangerous scenarios, by analyzing data from social networks combined with customer data Detecting trends in social networking for business digital content management Intelligent publishing content Big Data
    36. 36. Real time analysis of Big Data for decision making in agriculture Big Data
    37. 37. Analysis of data generated by a field of solar panels Big Data
    38. 38. Detecting trends in social networking Big Data
    39. 39. Why StrateBI for Big Data projects? Big Data recognized specialists in Spain (Hadoop, Spark, Hive, Flume, Hortonworks, Cloudera, Cassandra, HP Vertica…) Backed by our projects and training performed with companies such as Boeing, Telefónica Educación Digital (TED), Gobierno de España, Schibsted Group, Prosegur, INCIBE (National Institute of Cybersecurity)… Spanish leaders of Open Source BI (Pentaho, Talend, Mondrian, Ctools, Saiku…) StrateBI has lead to production a hundreds of Business Intelligence systems with Pentaho for large companies such as BBVA, Telefónica, Globalia, Prosegur, ALD, Gobiernos de La Rioja, Extremadura, Baleares, Eroski, Equifax, Unilever, Amnistía Internacional, Caixa De Enginyers, Schibsted, etc… About Us
    40. 40. Private Sector About Us
    41. 41. Public Sector About Us
    42. 42. www.TodoBI.com info@stratebi.com www.stratebi.com More Info Tel: 91.788.34.10 Madrid: Avenida de Brasil, 17, Planta 16 Barcelona: C/ Valencia, 63 Brasil: Av. Paulista, 37 4 andar About Us

    ×