The document discusses Hadoop and big data technologies. It begins with an introduction to big data concepts and the various Hadoop components like HDFS, MapReduce, YARN, Hive, Pig and Mahout. It then explains how big data is different from traditional data warehousing through the concept of schema-on-read. Finally, it provides recommendations on tools for working with big data technologies locally and in the cloud, as well as sources of inspiration like sandbox environments, Apache projects and GitHub.
This document discusses strategies for handling mutable data in Hive's immutable data model. It presents several approaches including full refresh, full merge and replace, and partition-level merge and replace. The partition-level strategies allow merging incremental data updates into existing partitions in Hive tables. The document provides examples using Pig to filter, join, and load data to demonstrate performing incremental updates at the partition level. It evaluates the tradeoffs of different strategies based on data volumes and change rates.
This document summarizes key differences between SQL and NoSQL databases. It discusses the history and advantages of SQL, including its ubiquitous use, standardization, and ability to optimize queries based on schema. However, SQL has limitations for sparse and complex data. The document then introduces NoSQL databases and how they are designed for scale-out. It uses HBase as an example to illustrate how NoSQL databases can solve SQL's distributed join challenges by co-locating related data without schemas. While this improves performance, it also has disadvantages for modeling relationships and exploring data.
This document discusses the challenges of implementing SQL on Hadoop. It begins by explaining why SQL is useful for Hadoop, as it provides a familiar syntax and separates querying logic from implementation. However, Hadoop's architecture presents challenges for matching the functionality of a traditional data warehouse. Key challenges discussed include random data placement in HDFS, limitations on indexing due to this random placement, difficulties performing joins without data colocation, and limitations of existing "indexing" approaches in systems like Hive. The document explores approaches some systems are taking to address these issues.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
This document discusses strategies for handling mutable data in Hive's immutable data model. It presents several approaches including full refresh, full merge and replace, and partition-level merge and replace. The partition-level strategies allow merging incremental data updates into existing partitions in Hive tables. The document provides examples using Pig to filter, join, and load data to demonstrate performing incremental updates at the partition level. It evaluates the tradeoffs of different strategies based on data volumes and change rates.
This document summarizes key differences between SQL and NoSQL databases. It discusses the history and advantages of SQL, including its ubiquitous use, standardization, and ability to optimize queries based on schema. However, SQL has limitations for sparse and complex data. The document then introduces NoSQL databases and how they are designed for scale-out. It uses HBase as an example to illustrate how NoSQL databases can solve SQL's distributed join challenges by co-locating related data without schemas. While this improves performance, it also has disadvantages for modeling relationships and exploring data.
This document discusses the challenges of implementing SQL on Hadoop. It begins by explaining why SQL is useful for Hadoop, as it provides a familiar syntax and separates querying logic from implementation. However, Hadoop's architecture presents challenges for matching the functionality of a traditional data warehouse. Key challenges discussed include random data placement in HDFS, limitations on indexing due to this random placement, difficulties performing joins without data colocation, and limitations of existing "indexing" approaches in systems like Hive. The document explores approaches some systems are taking to address these issues.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
This document provides an overview of a SQL-on-Hadoop tutorial. It introduces the presenters and discusses why SQL is important for Hadoop, as MapReduce is not optimal for all use cases. It also notes that while the database community knows how to efficiently process data, SQL-on-Hadoop systems face challenges due to the limitations of running on top of HDFS and Hadoop ecosystems. The tutorial outline covers SQL-on-Hadoop technologies like storage formats, runtime engines, and query optimization.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Hadoop and object stores: Can we do it better?gvernik
Strata Data Conference, London, May 2017
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
The document summarizes Carl Steinbach's presentation on SQL on Hadoop. It discusses how earlier systems like Hive had limitations for analytics workloads due to using MapReduce. A new architecture runs PostgreSQL on worker nodes co-located with HDFS data to enable push-down query processing for better performance. Citus Data's CitusDB product was presented as an example of this architecture, allowing SQL queries to efficiently analyze petabytes of data stored in HDFS.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
Originally Published on Oct 27, 2014
An overview of IBM's audited Hadoop-DS comparing IBM Big SQL, Cloudera Impala and Hortonworks Hive for performance and SQL compatibility. For more information, visit: http://www-01.ibm.com/software/data/infosphere/hadoop/
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
The document discusses various options for integrating Hadoop with an existing enterprise data warehouse (EDW). It describes 7 options: 1) Teradata Unified Data Architecture, 2) using an existing EDW with a new Apache Hadoop cluster, 3) using an existing EDW with a new Cloudera Hadoop cluster, 4) using an existing EDW with a new Hortonworks Hadoop cluster, 5) IBM PureData, 6) Oracle Big Data Appliance, and 7) SAP HANA for Hadoop integration. Each option involves using the existing EDW for structured data and Hadoop for unstructured/semi-structured data, with analytics capabilities available across both platforms.
This document discusses enterprise-grade big data solutions from HPE. It outlines HPE's reference architecture for big data workloads including components like data lakes, data warehouses, archival storage, event processing, and in-memory analytics. It also discusses HPE's investments in Hortonworks and collaboration to optimize Hadoop for performance. The document promotes attending an HPE session at the Hadoop Summit on modernizing data warehouses and visiting the HPE booth for demos and a trivia game.
This document discusses Robert Vandehey's experience moving from C#/.NET to Hadoop/MongoDB at Rovi Corporation. The three main points are:
1) Rovi was experiencing slow ETL and data loading times from their SQL databases into their applications. They implemented a solution using Hadoop to extract, transform, and load the data, and MongoDB to serve the data to applications.
2) Some challenges in the transition included moving the .NET development team to Linux/Java, backwards compatibility for web services, and writes to MongoDB overwhelming disks.
3) Lessons learned include using Hadoop and Pig for as much processing as possible, properly sizing and configuring MongoDB for writes from H
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
High Performance Python on Apache SparkWes McKinney
This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.
This document discusses Talend, an open source data integration tool that allows users to easily extract, load, and transform big data. It can be used to create workflows that move data between various file formats and databases. Talend provides drag-and-drop components to build ETL jobs without coding, and it has native support for popular big data technologies like HDFS, Hive, Pig and Sqoop. The document explains how Talend can be used to load data into HDFS, transform it with Pig, load it into a Hive data warehouse, and integrate relational databases with Hadoop using Sqoop.
1) Columnar formats like Parquet, Kudu and Arrow provide more efficient data storage and querying by organizing data by column rather than row.
2) Parquet provides an immutable columnar format well-suited for storage, while Kudu allows for mutable updates but is optimized for scans. Arrow provides an in-memory columnar format focused on CPU efficiency.
3) By establishing common in-memory and on-disk columnar standards, Arrow and Parquet enable more efficient data sharing and querying across systems without serialization overhead.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Hadoop and object stores: Can we do it better?gvernik
Strata Data Conference, London, May 2017
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
The document summarizes Carl Steinbach's presentation on SQL on Hadoop. It discusses how earlier systems like Hive had limitations for analytics workloads due to using MapReduce. A new architecture runs PostgreSQL on worker nodes co-located with HDFS data to enable push-down query processing for better performance. Citus Data's CitusDB product was presented as an example of this architecture, allowing SQL queries to efficiently analyze petabytes of data stored in HDFS.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
It’s no longer a world of just relational databases. Companies are increasingly adopting specialized datastores such as Hadoop, HBase, MongoDB, Elasticsearch, Solr and S3. Apache Drill, an open source, in-memory, columnar SQL execution engine, enables interactive SQL queries against more datastores.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
Originally Published on Oct 27, 2014
An overview of IBM's audited Hadoop-DS comparing IBM Big SQL, Cloudera Impala and Hortonworks Hive for performance and SQL compatibility. For more information, visit: http://www-01.ibm.com/software/data/infosphere/hadoop/
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
The document discusses various options for integrating Hadoop with an existing enterprise data warehouse (EDW). It describes 7 options: 1) Teradata Unified Data Architecture, 2) using an existing EDW with a new Apache Hadoop cluster, 3) using an existing EDW with a new Cloudera Hadoop cluster, 4) using an existing EDW with a new Hortonworks Hadoop cluster, 5) IBM PureData, 6) Oracle Big Data Appliance, and 7) SAP HANA for Hadoop integration. Each option involves using the existing EDW for structured data and Hadoop for unstructured/semi-structured data, with analytics capabilities available across both platforms.
This document discusses enterprise-grade big data solutions from HPE. It outlines HPE's reference architecture for big data workloads including components like data lakes, data warehouses, archival storage, event processing, and in-memory analytics. It also discusses HPE's investments in Hortonworks and collaboration to optimize Hadoop for performance. The document promotes attending an HPE session at the Hadoop Summit on modernizing data warehouses and visiting the HPE booth for demos and a trivia game.
This document discusses Robert Vandehey's experience moving from C#/.NET to Hadoop/MongoDB at Rovi Corporation. The three main points are:
1) Rovi was experiencing slow ETL and data loading times from their SQL databases into their applications. They implemented a solution using Hadoop to extract, transform, and load the data, and MongoDB to serve the data to applications.
2) Some challenges in the transition included moving the .NET development team to Linux/Java, backwards compatibility for web services, and writes to MongoDB overwhelming disks.
3) Lessons learned include using Hadoop and Pig for as much processing as possible, properly sizing and configuring MongoDB for writes from H
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
High Performance Python on Apache SparkWes McKinney
This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.
This document discusses Talend, an open source data integration tool that allows users to easily extract, load, and transform big data. It can be used to create workflows that move data between various file formats and databases. Talend provides drag-and-drop components to build ETL jobs without coding, and it has native support for popular big data technologies like HDFS, Hive, Pig and Sqoop. The document explains how Talend can be used to load data into HDFS, transform it with Pig, load it into a Hive data warehouse, and integrate relational databases with Hadoop using Sqoop.
1) Columnar formats like Parquet, Kudu and Arrow provide more efficient data storage and querying by organizing data by column rather than row.
2) Parquet provides an immutable columnar format well-suited for storage, while Kudu allows for mutable updates but is optimized for scans. Arrow provides an in-memory columnar format focused on CPU efficiency.
3) By establishing common in-memory and on-disk columnar standards, Arrow and Parquet enable more efficient data sharing and querying across systems without serialization overhead.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
Hadoop was developed to solve problems with data warehousing systems at Yahoo and Facebook that were limited in processing large amounts of raw data in real-time. Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. It allows for agile access to raw data at scale for ad-hoc queries, data mining and analytics without being constrained by traditional database schemas. Hadoop has been widely adopted for large-scale data processing and analytics across many companies.
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable storage of petabytes of data and large-scale computations across commodity hardware.
- Apache Hadoop is used widely by internet companies to analyze web server logs, power search engines, and gain insights from large amounts of social and user data. It is also used for machine learning, data mining, and processing audio, video, and text data.
- The future of Apache Hadoop includes making it more accessible and easy to use for enterprises, addressing gaps like high availability and management, and enabling partners and the community to build on it through open APIs and a modular architecture.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for massive data storage, enormous processing power, and the ability to handle large numbers of concurrent tasks across clusters of commodity hardware. The framework includes Hadoop Distributed File System (HDFS) for reliable data storage and MapReduce for parallel processing of large datasets. An ecosystem of related projects like Pig, Hive, HBase, Sqoop and Flume extend the functionality of Hadoop.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
This presentation provides an overview of big data concepts and Hadoop technologies. It discusses what big data is and why it is important for businesses to gain insights from massive data. The key Hadoop technologies explained include HDFS for distributed storage, MapReduce for distributed processing, and various tools that run on top of Hadoop like Hive, Pig, HBase, HCatalog, ZooKeeper and Sqoop. Popular Hadoop SQL databases like Impala, Presto and Stinger are also compared in terms of their performance and capabilities. The document discusses options for deploying Hadoop on-premise or in the cloud and how to integrate Microsoft BI tools with Hadoop for big data analytics.
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
Architecting the Future of Big Data and SearchHortonworks
The document discusses the potential for integrating Apache Lucene and Apache Hadoop technologies. It covers their histories and current uses, as well as opportunities and challenges around making them work better together through tighter integration or code sharing. Developers and businesses are interested in ways to improve searching large amounts of data stored using Hadoop technologies.
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
Der Vortrag gibt zunächst einen Architektur-Überblick zu den UIA-Komponenten und deren Zusammenspiel. Anhand eines Use Cases wird vorgestellt, wie im "UIA Data Reservoir" einerseits kostengünstig aktuelle Daten "as is" in einem Hadoop File System (HDFS) und andererseits veredelte Daten in einem Oracle 12c Data Warehouse miteinander kombiniert oder auch per Direktzugriff in Oracle Business Intelligence ausgewertet bzw. mit Endeca Information Discovery auf neue Zusammenhänge untersucht werden.
The other Apache Technologies your Big Data solution needsgagravarr
The document discusses many Apache projects relevant to big data solutions, including projects for loading and querying data like Pig and Gora, building MapReduce jobs like Avro and Thrift, cloud computing with LibCloud and DeltaCloud, and extracting information from unstructured data with Tika, UIMA, OpenNLP, and cTakes. It also mentions utility projects like Chemistry, JMeter, Commons, and ManifoldCF.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
Apache Drill is an interactive SQL query engine for analyzing large scale datasets. It allows for querying data stored in HBase and other data sources. Drill uses an optimistic execution model and late binding to schemas to enable fast queries without requiring metadata definitions. It leverages recent techniques like vectorized operators and late record materialization to improve performance. The project is currently in alpha stage but aims to support features like nested queries, Hive UDFs, and optimized joins with HBase.
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points:
- CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage.
- They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases.
- Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics.
- Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
This document outlines the modules and topics covered in an Edureka course on Hadoop. The 10 modules cover understanding Big Data and Hadoop architecture, Hadoop cluster configuration, MapReduce framework, Pig, Hive, HBase, Hadoop 2.0 features, and Apache Oozie. Interactive questions are also included to test understanding of concepts like Hadoop core components, HDFS architecture, and MapReduce job execution.
The document discusses big data, including what it is, sources of big data like social media and stock exchange data, and the three Vs of big data - volume, velocity, and variety. It then discusses Hadoop, the open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for distributed storage, MapReduce for distributed computation, and YARN which manages computing resources. The document also provides overviews of Pig and Jaql, programming languages used for analyzing data in Hadoop.
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...kalichargn70th171
In today's business landscape, digital integration is ubiquitous, demanding swift innovation as a necessity rather than a luxury. In a fiercely competitive market with heightened customer expectations, the timely launch of flawless digital products is crucial for both acquisition and retention—any delay risks ceding market share to competitors.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Drona Infotech is a premier mobile app development company in Noida, providing cutting-edge solutions for businesses.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
Liberarsi dai framework con i Web Component.pptxMassimo Artizzu
In Italian
Presentazione sulle feature e l'utilizzo dei Web Component nell sviluppo di pagine e applicazioni web. Racconto delle ragioni storiche dell'avvento dei Web Component. Evidenziazione dei vantaggi e delle sfide poste, indicazione delle best practices, con particolare accento sulla possibilità di usare web component per facilitare la migrazione delle proprie applicazioni verso nuovi stack tecnologici.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
UI5con 2024 - Bring Your Own Design SystemPeter Muessig
How do you combine the OpenUI5/SAPUI5 programming model with a design system that makes its controls available as Web Components? Since OpenUI5/SAPUI5 1.120, the framework supports the integration of any Web Components. This makes it possible, for example, to natively embed own Web Components of your design system which are created with Stencil. The integration embeds the Web Components in a way that they can be used naturally in XMLViews, like with standard UI5 controls, and can be bound with data binding. Learn how you can also make use of the Web Components base class in OpenUI5/SAPUI5 to also integrate your Web Components and get inspired by the solution to generate a custom UI5 library providing the Web Components control wrappers for the native ones.
2. Introduction
Eduard Erwee
Data Soil Ltd (www.datasoil.uk)
Background
Working with Microsoft data products over 20 years
MCSD VB6, SQL Server 7
5 years as Microsoft Certified Trainer
4 years as SQL Server PFE, Reading – UK
Today, clean data toilets for the highest bidder
No Linux / No Big Data (until 9 months ago)
3. Agenda
A) What is Big data?
i) Origins
ii) Technologies & Terminologies
iii) The Players
B) How is Big Data Different?
i) Philosophies
C) How to ride the Elephant?
i) All about the tools
ii) Sources of Inspiration
D) BIG to the Future!
i) Current Common Use-cases
ii) Future Opportunities
E) Summary
F) Conclusion
G) Q&A
4. What is Big data?
i) Origins
Nutch-to-Google-to-Yahoo and beyond
Apache Who??
ii) Technologies & Terminologies
Core Hadoop
Hive
HCatalog
Pig
Sqoop
Oozie
HUE (flavours-of)
Mahout
Loads of others
Ha-dump!
iii) The Players
The Big 3
One to Watch : Cascading & Lingual
5. i) Origins
Nutch-to-Google-to-Yahoo and beyond
Apache Who??
6. Nutch-to-Google-to-Yahoo and beyond
Doug Cutting & Mike Cafarella starts working on Nutch (Open source web search engine based on Lucene and Java)
Google publishes GFS and MapReduce papers
Cutting adds DFS & MapReduce support to Nutch
Yahoo! hires Cutting, Hadoop spins off Nutch (named after Cutting's Son's Toy Elephant)
NY Times converts 4 TB of achives over 100 EC2's
Web scale deployments at Y!, Facebook, Last.fm
2002 2003 2004 2005 2006 2007 2008 2009
April : Y! does fastest TB sort, 3.5min over 910 nodes
May : Y! fastest TB sort, 62 seconds over 1460 nodes
May : Y! sorts PB, 16.25 hours over 3658 nodes
Today Hadoop is -- Apache top-level project
History ->Appendix **1
7. Apache Who??
The Apache Software Foundation (http://www.apache.org/)
The ASF is made up of nearly 150 Top Level Projects (Big Data and more)
Most of the Hadoop components we will discuss
All trademarks mentioned herein belong to their respective owners
9. Core Hadoop
Hadoop Common:
The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™):
A distributed file system that provides high-throughput access to application data.
Images ->Appendix **2
All trademarks mentioned herein belong to their respective owners
11. Core Hadoop
Hadoop MapReduce (continues):
MapReduce-V2
A YARN-based system for parallel processing of large data sets.
Built on top of Tez
Hadoop YARN (Yet Another Resource Negotiator):
A framework for job scheduling and cluster resource management.
Image ->Appendix *5
12. HUE (flavours-of)
Hue aggregates the most common Apache Hadoop components into a single UI.
"Just use" Hadoop web based interface without worrying command line.
13. Hive
Managing large datasets residing HDFS.
Mechanism to query the data using a SQL-like language called HiveQL.
Runs in HUE
14. HCatalog
Built on top of the Hive metastore and incorporates Hive's DDL
HCatalog’s table abstraction - presents relational view - of data in (HDFS)
Removes worry about format their data is stored
For me - Very similar to a set of views in SQL Server over staging feeds
Exposed to Pig / Map Reduce / Hive
Runs in HUE
Image ->Appendix *5
16. Pig
Pig is a high-level platform used for creating MapReduce.
The programming language is called Pig Latin
Optimizer turns Pig into optimized Java Mapreduce.
Similar to M in Power Query
It’s the VB.net Vs C++ debate all over again.
Structure
Hive require data to be more structured
Pig allows you to work with unstructured data.
Compatible with Hcatalog
Runs in Hue
17. Sqoop
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases.
Runs in Hue
18. Oozie
Workflow scheduler system to manage Apache Hadoop jobs.
Oozie Coordinator jobs
Recurrent Oozie Workflow
Jobs triggered
by time (frequency)
data availabilty.
Integrated with the rest of the Hadoop stack
Scalable, reliable and extensible system.
Available in HUE
19. Mahout
Goal : scalable machine learning library.
Examples of Mahout use cases:
Recommendation mining
Mahout ->Appendix **4
takes users' behaviour and from that tries to find items users might like. (Netflix)
Clustering
Group documents, web pages and articles based on
contained topics
their related documents.
Most common use of this is search engines, which cluster pages based on keywords, page
links, etc.
Classification
Based on prior categorization of documents
Evaluates new documents and determine best categories.
Filter new mail into INBOX
Auto-organize new content
flag potential spam comments.
20. Loads of others
All trademarks mentioned herein belong to their respective owners
22. iii) The Players
The Big 3
One to Watch : Cascading & Lingual
23. The Big 3
Hortonworks claims to be the only
fully open source distribution.
Cloudera is close on their heals
with everything based on open
source but has some additional
maintenance and installation
functionality that is proprietary
MAP-R on the other hand re-wrote
the storage engine from scratch to
improve performance at the cost of
being vendor specific
My Opinion ?
Benchmarking -- Altoros
Altoros did some significant benchmarking between the 3, and can be found here:
All trademarks mentioned herein belong to their respective owners
24. One To Watch : Cascading & Lingual
Developed by Chris Wensel & Team from Concurrent:
http://www.concurrentinc.com/
Cascading is a development platform for building data applications on Hadoop
Developed on top of Cascading:
Lingual
Simplifies systems integration -- ANSI SQL compatibility -- JDBC driver
Pattern
Machine learning scoring algorithms through PMML compatibility
Scalding
Enables development with Scala, a powerful language for solving functional problems
Cascalog
Enables development with Clojure, a Lisp dialect
Driven
Understand data usage + accelerate Cascading application development and management
25. Driven -- Visualize Development of Flows
Like SSMS Execution Plans
Breaks up Query
Shows Data flow
Drill down ….
26. Driven -- Application Insights
Drill down into steps
Execution Time
Bottle-necks
Resource usage
27. Why Watch : Cascading & Lingual ?
All 3 Big data platform vendors mentioned before
supports Cascading integration
investing in ensuring continued support for Cascading on their own platforms
Used by
Single platform to develop code on that evolves with changing big data landscape.
Single JAR deployment.
Ansi-92 interface via JDBC for moving data between systems / platforms
All Open-Source (no vendor lock-in)
Data Soil is contributing to develop the SQL Server Plug-in for Cascading & Lingual.
(see our blogs for getting into Cascading using Microsoft Technologies)
All trademarks mentioned herein belong to their respective owners
28. B) How is Big Data Different?
Philosophies
Current Architecture vs Schema-On-Read
S-O-R : Advantages & Disadvantages
Integration with SQL Server & Windows
29. Current Architecture vs Schema-On-Read
Current BI Architecture Big Data BI Architecture
Get Business Requirements and
prioritize
Find / Collect all relevant data
sources
Normalize / copy to staging / create
structures / schemas / ETL
Create Warehouse / Cube
Start answering questions
1 / 2 / 3 / 4 / 5
Get Business Requirements and
prioritize
All Data is already in the Ha-dump
Create schema for question 1 / ETL
Send processing instructions to data
Answer question 1
{& Repeat}
30. S-O-R : Advantages & Disadvantages
Advantages
Store first, ask questions later
Storage is cheap compare to high availability SAN
Format agnostic as not pre-normalization / conversion required
All data is available in a central place
High degree of parallel processing speeds up large batch processing
Possible to start answering business questions quicker
Disadvantages
New skillsets & training required
Company may not support new software stack
Creating new schemas for proprietary data can be difficult
31. Integration with SQL Server & Windows
ODBC
Hortonworks / Cloudera / MAPR all have supported ODBC drivers
Create Linked Servers directly from SQL Server
SSIS integration
Pull Data directly into Excel (see Hortonworks Sandbox)
JDBC & Other
Tableau / squirrel-sql / Revolution R / Business Objects ext.
Other ETL Tools
Talend (to be discussed later)
Local Install
Hortonworks Data Platform (HDP)
HDInsight Emulator
32. C) How to ride the Elephant?
i) All about the tools
Local VM platform providers
Online platform providers
Vagrant
Talend
Reuse of old machines
ii) Sources of Inspiration
Sandbox’s
The Apache Software Foundation
Github
33. i) All about the tools
Local VM platform providers
Online platform providers
Vagrant
Talend
Pet Project : Reuse of old machines
34. Local VM platform providers
Hyper-V (Microsoft)
Windows Server
Windows 8.1
VMWARE
VMWARE Server Products
Workstation - On Windows
Personally, I absolutely LOVE Workstation 10.0
Fusion - On Mac
Virtual Box (Oracle)
Runs on EVERTYHTING
Close second favourite
Integrates extremely well with Vagrant (to be discussed)
All trademarks mentioned herein belong to their respective owners
35. Online platform providers
Azure & Big Data
HD-Insight (Based on Hortonworks HDP platform)
Real World Big Data (SQL-Bits Session)
Adam Jorgensen / John Welch
Restored my confidence in MS Big Data Cloud Solutions
Amazon Cloud (AWS)
EC2
Host of supporting services
All trademarks mentioned herein belong to their respective owners
36. Vagrant
Vagrant provides
easy to configure,
reproducible,
and portable work environments built on industry standards.
Spins up / Hibernates / Destroys complex development environments with one
line of code
Supports Virtualbox / VMWARE / Docker / Hyper-V / Custom Providers
Ability to spin up environments locally or directly to Amazon EC2
All trademarks mentioned herein belong to their respective owners
37. Talend
Enterprise grade development environment for creating data integration
across just about anything.
Talend Open Studio for Big Data
BASIC - Free
Eclipse-Based Tooling
Hadoop 2.0 and YARN Support
Big Data ETL and ELT
HDFS, HBase, HCatalog, Hive, Pig, Sqoop Components
Job Designer
Apache License 2.0
Broadest NoSQL Support
Fully Open Source
http://www.talend.com/download
All trademarks mentioned herein belong to their respective owners
40. Talend
Supported Database & Data Source Connectivity
Amazon RDS HIVE Oracle
Amazon Redshift HSQLDB ParAccel
Amazon S3 Informix PostgresSQL
AS400 Ingres PostgresPlus
DB2 InterBase SAS
Derby DB JavaDB SQLite
Exasol JDBC Sybase
eXist-db MaxDB Teradata
Firebird Microsoft OLE-DB VectorWise
Google Storage Microsoft SQL Server Vertica
Greenplum MySQL Windows Azure Blob Storage
H2 Netezza
41. Pet project : Reuse of old machines
Challenge your manager
If you can build a cluster from your old desktops that will outperform his current
development server, he has to give you a raise!
You’d be surprised what you can do with a pile of these!
42. ii) Sources of Inspiration
Sandbox’s
The Apache Software Foundation
Github
43. Sandbox’s
All three the Big Data Players have their pre-built Sandbox’s you can download and
experiment with
Hortonworks
Current Version 2.1
Supports: VirtualBox / VMWare / Hyper-V
Cloudera
Current Version CDH 5.0.x
Cloudera Live online (beta)
Supports: VirtualBox / Vmware / Linux KVM (Kernel-based Virtual Machine)
MAPR
Supports: VirtualBox / Vmware
Cascading & Lingual
Vagrant Image that spins up 4 Node Cluster via GitHub
Supports: VirtualBox
44. The Apache Software Foundation
Want to know about BIG future technologies
Apache Incubator – (http://incubator.apache.org/)
Tez Speed up MapReduce
Storm high-performance realtime computation system
Optiq SQL interface & advanced query optimization – non-RDBMS systems
Falcon quickly onboard their data,associated processing & management tasks on
Hadoop clusters
45. Github
GitHub is a web-based hosting service based on Git.
Git a distributed revision control and source code management (SCM) system
initially designed and developed by Linus Torvalds for Linux kernel development
Great source of Vagrant-Based VM’s
Cascading & Lingual Cluster (Get Vagrant & Virtual Box)
https://github.com/Cascading/vagrant-cascading-hadoop-cluster
46. D) BIG to the Future!
i) Current Common Use-cases
ii) Future Opportunities
47. i) Current Common Use-cases
Sentiment (twitter feeds / wordpress scrapes / facebook likes)
Natural Language Processing : Stanford (http://nlp.stanford.edu:8080/sentiment/rntnDemo.html)
Recommendation Engines using Mahout / Other (Netflix)
Anti Money Laundering ??
Live Transaction monitoring – not that big for some reason
Graph Databases seems to be doing better here.
49. Sensors
These days, sensors can be installed everywhere to monitor all aspects of life
/ business
Temperature Sensors
Pressure Sensors
Gas Sensors
Smoke Sensors
A better understanding of day to day happenings can save money and lives.
50. Self-Contained Clusters
Met these guys at the Hadoop Summit in Amsterdam 2014
(http://bigboards.io/)
5 data processing nodes
20 CPU cores and 5TB of raw storage
1GB ethernet to interlink everything
1 management console with technology and data library
54. E) Summary
Big data does not replace random read and reporting capabilities of SQL
Server.
Big Data is not close to replacing our
trusted
high volume
transaction safe
OLTP frameworks we built.
Big data opens up opportunities for storing and processing date at a larger
scale than we could never have dreamed of before.
55. F) Conclusion
THE FUTURE is not going to be won by one OR the other …
…but by a combination of BOTH!
58. Appendix : References
**1) Hadoop : Distributed Data Procesing [Amr Awadallah]
http://www.slideshare.net/cloudera/hadoop-distributed-data-processing
**2) Hadoop [K Subrahmanyam]
http://www.authorstream.com/Presentation/aSGuest129127-1356869-techseminar-on-hadoop-
ppt/
**3) An Introduction to Apache Hadoop MapReduce [Mike Frampton]
http://www.powershow.com/view/3fdd1b-
MGRkZ/An_Introduction_to_Apache_Hadoop_MapReduce_powerpoint_ppt_presentation
**4) Mahout Explained in 5 Minutes or Less [Josh Gertzen]
http://blog.credera.com/technology-insights/java/mahout-explained-5-minutes-less/
**5) What is Apache Tez? [Roopesh Shenoy]
http://www.infoq.com/articles/apache-tez-saha-murthy
59. Thank you – COPY OF SLIDES ON WEB!
Eduard Erwee
Data Soil Ltd
E-mail : eduard.erwee@datasoil.uk
Web Site : www.datasoil.uk
Blog : blog.datasoil.uk
Twitter : @datasoil
Facebook : www.facebook.com/datasoil
Please Remember to do the feedback form online
http://www.sqlbits.com/SQLBitsXIISaturday