The document discusses Syoncloud, a consulting company specialized in big data analytics and integration. It provides examples of common data sources that can be processed using big data solutions, such as documents, databases, emails, sensor data, and social media. It then discusses considerations for when a NoSQL or big data solution is needed over a relational database, and outlines some of the key components of Apache Hadoop batch processing infrastructure, including Apache Avro for data serialization, Apache Pig for writing MapReduce scripts, and Apache Hive for SQL-like queries.
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
In the crowded SQL-on-Hadoop market, choosing the right solution for your business can be difficult. In this webinar, learn firsthand from Rick van der Lans, independent analyst and managing director of R20/Consultancy, how to sort through this market complexity and what tough questions to ask when evaluating perspective SQL-on-Hadoop solutions.
Introduction to Apache Hadoop. Includes Hadoop v.1.0 and HDFS / MapReduce to v.2.0. Includes Impala, Yarn, Tez and the entire arsenal of projects for Apache Hadoop.
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
Real Time Monitoring requires a high scalable infrastructure of message bus, database, distributed event processing and scalable analytics engine. By bringing together leading open source projects of Apache Kafka, Apache HBase, Apache Storm and Apache Hive, the Hortonworks Data Platform offers a comprehensive Real Time Analysis platform. In this session, we will provide an in-depth overview all the key technology components and demonstrate a working solution for monitoring a fleet of trucks.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=0278dc8aa49a9991e1ce436c71f53d30
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit
Big SQL is a SQL engine for Hadoop that excels at performance and scalability at high concurrency. Big SQL complements and integrates with Apache Hive for both data and metadata. An architecture that separates compute from storage allows Big SQL to support multiple open data formats natively. Until recently, Parquet provided a significant performance advantage over other data formats for SQL on Hadoop. The landscape changed when ORC became a top level Apache project independent from Hive. Gone were the days of reading ORC files using slow, single-row-at-a-time Hive Serdes. The new vectorized APIs in the Apache ORC libraries make it possible to ingest ORC data at blazing speed. This talk is about the journey leading to ORC taking the crown of best performing data format for Big SQL away from Parquet. We'll have a look under the hood at the architecture of Big SQL ORC readers, and how to tune them. We'll share lessons learned in walking the fine line between maximizing performance at scale and avoiding dreaded Java OOMs . You'll learn the techniques that SQL engines use for fast data ingestion, so that you can leverage the full potential of Apache ORC in any application.
Speaker:
Gustavo Arocena, Big Data Architect, IBM
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
In the crowded SQL-on-Hadoop market, choosing the right solution for your business can be difficult. In this webinar, learn firsthand from Rick van der Lans, independent analyst and managing director of R20/Consultancy, how to sort through this market complexity and what tough questions to ask when evaluating perspective SQL-on-Hadoop solutions.
Introduction to Apache Hadoop. Includes Hadoop v.1.0 and HDFS / MapReduce to v.2.0. Includes Impala, Yarn, Tez and the entire arsenal of projects for Apache Hadoop.
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
Real Time Monitoring requires a high scalable infrastructure of message bus, database, distributed event processing and scalable analytics engine. By bringing together leading open source projects of Apache Kafka, Apache HBase, Apache Storm and Apache Hive, the Hortonworks Data Platform offers a comprehensive Real Time Analysis platform. In this session, we will provide an in-depth overview all the key technology components and demonstrate a working solution for monitoring a fleet of trucks.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=0278dc8aa49a9991e1ce436c71f53d30
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit
Big SQL is a SQL engine for Hadoop that excels at performance and scalability at high concurrency. Big SQL complements and integrates with Apache Hive for both data and metadata. An architecture that separates compute from storage allows Big SQL to support multiple open data formats natively. Until recently, Parquet provided a significant performance advantage over other data formats for SQL on Hadoop. The landscape changed when ORC became a top level Apache project independent from Hive. Gone were the days of reading ORC files using slow, single-row-at-a-time Hive Serdes. The new vectorized APIs in the Apache ORC libraries make it possible to ingest ORC data at blazing speed. This talk is about the journey leading to ORC taking the crown of best performing data format for Big SQL away from Parquet. We'll have a look under the hood at the architecture of Big SQL ORC readers, and how to tune them. We'll share lessons learned in walking the fine line between maximizing performance at scale and avoiding dreaded Java OOMs . You'll learn the techniques that SQL engines use for fast data ingestion, so that you can leverage the full potential of Apache ORC in any application.
Speaker:
Gustavo Arocena, Big Data Architect, IBM
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Speakers:
John Hol, Regional Director, Attunity
Mike Hollobon, Director Business Development, IBT
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
This is an in-depth look at the future of data warehouses and how SQL-on-Hadoop technologies play a pivotal role in those settings.
Matt Aslett, Research Director for 451 Research, is joined by Apache Drill architect Jacques Nadeau to share what lies ahead for enterprise data warehouse architects and BI users in 2015 and beyond.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including:
+ Why you need Apache Falcon
+ Key new Falcon features
+ Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
How can you simplify the management and monitoring of your Hadoop environment? Ensure IT can focus on the right business priorities supported by Hadoop? Take a look at this presentation and learn how you can simplify the management and monitoring of your Hadoop environment, and ensure IT can focus on the right business priorities supported by Hadoop.
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Hortonworks
This presentation gives an overview of how Hortonworks and Red Hat have collaborated to provide the financial services industry with Big Data solutions.
Introduction to the Hortonworks YARN Ready ProgramHortonworks
The recently launched YARN Ready Program will accelerate multi-workload Hadoop in the Enterprise. The program enables developers to integrate new and existing applications with YARN-based Hadoop. We will cover:
--the program and it's benefits
--why it is important to customers
--tools and guides to help you get started
--technical resources to support you
--marketing recognition you can leverage
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
Hadoop and Internet of Things presentation from Sinergija 2014 conference, held in Belgrade in October 2014. How the rising data resources change the business, and how the Big Data technologies combined with Internet of Things devices can help to improve the business and the everyday life. Hadoop is already the most significant technology for working with Big Data. Microsoft is playing a very important role in this field, with the Stinger initiative. The main goal is to bring the enterprise SQL at Hadoop scale.
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Speakers:
John Hol, Regional Director, Attunity
Mike Hollobon, Director Business Development, IBT
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
This is an in-depth look at the future of data warehouses and how SQL-on-Hadoop technologies play a pivotal role in those settings.
Matt Aslett, Research Director for 451 Research, is joined by Apache Drill architect Jacques Nadeau to share what lies ahead for enterprise data warehouse architects and BI users in 2015 and beyond.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including:
+ Why you need Apache Falcon
+ Key new Falcon features
+ Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
How can you simplify the management and monitoring of your Hadoop environment? Ensure IT can focus on the right business priorities supported by Hadoop? Take a look at this presentation and learn how you can simplify the management and monitoring of your Hadoop environment, and ensure IT can focus on the right business priorities supported by Hadoop.
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Hortonworks
This presentation gives an overview of how Hortonworks and Red Hat have collaborated to provide the financial services industry with Big Data solutions.
Introduction to the Hortonworks YARN Ready ProgramHortonworks
The recently launched YARN Ready Program will accelerate multi-workload Hadoop in the Enterprise. The program enables developers to integrate new and existing applications with YARN-based Hadoop. We will cover:
--the program and it's benefits
--why it is important to customers
--tools and guides to help you get started
--technical resources to support you
--marketing recognition you can leverage
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
Hadoop and Internet of Things presentation from Sinergija 2014 conference, held in Belgrade in October 2014. How the rising data resources change the business, and how the Big Data technologies combined with Internet of Things devices can help to improve the business and the everyday life. Hadoop is already the most significant technology for working with Big Data. Microsoft is playing a very important role in this field, with the Stinger initiative. The main goal is to bring the enterprise SQL at Hadoop scale.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:
1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.
2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.
3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.
4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.
5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.
This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
How to Radically Simplify Your Business Data ManagementClusterpoint
Relational databases were designed for tabular data storage model. It requires complex software: schemas, encoded data, inflexible relations, sophisticated indexes. Complexity of your IT systems increases over your database life-time many-fold. Your costs too. Yet, we have a solution for this.
In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
Azure Data Platform Services
HDInsight Clusters in Azure
Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
Healthcare / Life Sciences Use Cases
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
Azure Big Data: “Got Data? Go Modern and Monetize”.
In this session you will learn how to architected, developed, and build completely in the open, Hortonworks Data Platform (HDP) that provides an enterprise ready data platform to adopt a Modern Data Architecture.
The Transformation of your Data in modern IT (Presented by DellEMC)Cloudera, Inc.
Organizations have a wealth of data contained within the existing infrastructures. At DellEMC we’re helping customers remove the barriers of legacy datastores and transforming the customer experience in the modern datacentre. Learn how to unshackle the valuable data inside your existing data warehouse, leverage new techniques, applications and technology to enhance the financial impact of all your data sources
2. Ladislav Urban CEO of Syoncloud.
Syoncloud is a consulting company specialized in
Big Data analytics and integration of existing
systems.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
3. CURRENT SOURCES OF DATA TO
BE PROCESSED AND UTILIZED
Documents
Existing relational databases (CRM, ERP, Accounting, Billing)
E-mails and attachments
Imaging data (graphs, technical plans)
Sensor or device data
Internet search indexing
Log files
Social media
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
4. CURRENT SOURCES OF DATA TO
BE PROCESSED AND UTILIZED
Telephone conversations
Videos
Pictures
Clickstreams (clicks from users on web pages)
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
5. SCALE OF THE DATA
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
6. WHEN DO WE NEED NOSQL / BIG
DATA SOLUTION?
If relational databases do not scale to your traffic needs
If normalized schema of your relational database became too
complex.
If your business applications generate lots of supporting and
temporary data
If database schema is already denormalized in order to
improve response times
If joins in relational databases slow the system down to a crawl
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
7. WHEN DO WE NEED NOSQL /
BIG DATA SOLUTION?
We try to map complex hierarchical documents to
Database tables
Documents from different sources require flexible
schema
When more data beats clever algorithms
Flexibility is required for analytics
Queries for values at specific time in history
Need to utilize outputs from many existing systems
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
8. WHEN DO WE NEED NOSQL /
BIG DATA SOLUTION?
To analyze unstructured data such as documents, log
files or semi-structured data such as CSV files and
forms
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
9. WHAT ARE THE STRONG POINTS
OF RELATIONAL DATABASES?
SQL language. It is well known, standardized and based on
strong mathematical theories.
Database schemas that do not to be modified during
production.
Scalability is not required
Mature security features: Role-based security, encrypted
communications, row and field access control
Full support of ACID transactions (atomicity, consistency,
isolation, durability)
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
10. WHAT ARE THE STRONG POINTS
OF RELATIONAL DATABASES?
Support for backup and rollback for data in case of
data loss or corruption.
Relational database do have development, tuning and
monitoring tools with good GUI
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
11. Batch vs Real-time Processing
Batch processing is used when real-time processing is
not required, not possible or too expensive.
Conversion of unstructured data such as text files and
log files into more structured records
Transformation during ETL
Ad-hoc analysis of data
Data analytics application and reporting
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
13. BATCH PROCESSING INFRASTRUCTURE
Batch processing systems utilize Map/Reduce and
HDFS implementation in Apache Hadoop.
It is possible to develop batch processing application
in Java using only Hadoop but we should mention
other important systems and how they fit into
Hadoop infrastructure.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
14. APACHE AVRO
In order to process data we need to have information
about data-types and data-schemas.
This information is used for serialization and
deserialization for RPC communications as well as
reading and writing to files.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
15. APACHE AVRO
RPC and serialization system that supports reach
data structures
It uses JSON to define data types and protocols
It serializes data in a compact binary format
Avro supports Schema evolution
Avro will handle missing/extra/modified fields.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
16. SCRIPT LANGUAGE FOR MAP/REDUCE
We need a quick and simple way to create
Map/Reduce transformations, analysis and
applications.
We need a script language that can be used in scripts
as well as interactively on command line.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
18. APACHE PIG
High-level procedural language for querying large
semi-structured data sets using Hadoop and the
Map/Reduce Platform
Pig simplifies the use of Hadoop by allowing SQL-like
queries to run on distributed dataset.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
19. APACHE PIG
An example of filtering log file for only Warning messages
that will run in parallel on large cluster.
Given script is automatically transformed into Map/Reduce
program and distributed across Hadoop cluster.
messages = LOAD '/var/log/messages';
warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
DUMP warns
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
20. APACHE PIG
Relational operators that can be used in Pig
FILTER - Select a set of tuples from a relation based on a condition.
FOREACH - Iterate the tuples of a relation, generating a data
transformation.
GROUP - Group the data in one or more relations.
JOIN - Join two or more relations (inner or outer join).
LOAD - Load data from the file system.
ORDER - Sort a relation based on one or more fields.
SPLIT - Partition a relation into two or more relations.
STORE - Store data in the file system.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
21. What if we want to use SQL to create
map/reduce jobs?
Apache Hive is a data warehousing infrastructure
based on the Hadoop
It provides query language called Hive QL, which is
based on SQL.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
22. APACHE HIVE
Hive functions: data summarization, query and
analysis.
It uses system catalog called Hive-Metastore.
Hive is not designed for OLTP or Real-time queries.
It is best used for batch jobs over large sets of
append-only data.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
24. HiveQL language supports ability to
Filter rows from a table using a where clause.
Select certain columns from the table using a select clause.
Do equi-joins between two tables.
Evaluate aggregations on multiple "group by" columns for the
data stored in a table.
Store the results of a query into another table.
Download the contents of a table to a local (NFS) directory.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
25. HiveQL language supports ability to
Store the results of a query in a HDFS directory.
Manage tables and partitions (create, drop and alter).
Plug in custom scripts in the language of choice for custom
map/reduce jobs.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
26. APACHE OOZIE
Map/Reduce jobs, Pig Scripts and Hive queries
should be simple and single purposed.
How can we create complex ETL or data analysis in
Hadoop?
We chain scripts so output of one script is an input
for another.
Complex workflows that represents real-world
scenarios need workflow engine such as Apache
Oozie.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
27. APACHE OOZIE
Oozie is a server based Workflow Engine specialized in
running workflow jobs with actions that run Hadoop
Map/Reduce, Pig jobs and other.
Oozie workflow is a collection of actions arranged in DAG
(Directed Acyclic Graph).
This means that second action can not run until the first
one is completed.
Oozie workflows definitions are written in hPDL (a XML
Process Definition Language similar to JBOSS JBPM jPDL).
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
28. APACHE OOZIE
Workflow actions start jobs in Hadoop cluster. Upon
action completion, the Hadoop callback Oozie to
notify the action completion, at this point Oozie
proceeds to the next action in the workflow.
Oozie workflows contain control flow nodes (start,
end, fail, decision, fork and join) and action nodes
(Actual Jobs).
Workflows can be parameterized (using variables
like ${inputDir} within the workflow definition)
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
29. Example of OOZIE workflow definition
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
33. APACHE Sqoop
Apache Sqoop is a tool for transferring bulk data between
Apache Hadoop and structured datastores such as
relational databases or data warehouses.
It can be used to populate tables in Hive and HBase.
Sqoop integrates with Oozie, allowing you to schedule
and automate import and export tasks.
Sqoop uses a connector based architecture which
supports plugins that provide connectivity to external
systems.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
34. APACHE Sqoop
Sqoop includes connectors for databases such as
MySQL, PostgreSQL, Oracle, SQL Server, DB2 and
generic JDBC connector.
Transferred dataset is sliced up into partitions and
map-only job is launched with individual mappers
responsible for transferring a slice of this dataset.
Sqoop uses the database metadata to infer data types
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
35. Apache Sqoop – Import to HDFS
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
36. APACHE Sqoop
Sqoop example to import data from MySQL database ORDERS
table to Hive table running on Hadoop.
sqoop import --connect jdbc:mysql://localhost/acmedb
--table ORDERS --username test --password **** --hive-
import
Sqoop takes care of populating Hive metastore with
appropriate metadata for the table and also invokes necessary
commands to load the table or partition.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
37. Apache Sqoop – Export to Database
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
38. APACHE FLUME
▪ Is a distributed system to reliably collect, aggregate and
move large amounts of log data from many different
sources to a centralized data store.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
40. APACHE FLUME
Flume Source consumes events delivered to it by an
external source like a web server.
When a Flume Source receives an event, it stores it
into one or more Channels.
The Channel is a passive store that keeps the event
until it is consumed by a Flume Sink.
The Sink removes the event from the Channel and
puts it into an external repository like HDFS
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
41. APACHE FLUME FEATURES
It allows to build multi-hop flows where events travel through
multiple agents before reaching the final destination.
It also allows fan-in and fan-out flows, contextual routing and
backup routes (fail-over) for failed hops.
Flume uses a transactional approach to guarantee reliable
delivery of events.
Events are staged in the channel, which manages recovery from
failure.
Flume supports log stream types such as Avro, Syslog, Netcat .
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474
42. DISTCP - DISTRIBUTED COPY
DistCp (distributed copy) is a tool used for large
inter/intra-cluster copying.
It uses Map/Reduce for its distribution, error handling
and recovery and reporting.
It expands a list of files and directories into input to map
tasks, each of which will copy a partition of the files
specified in the source list.
WWW.SYONCLOUD.COM E-MAIL : INFO@SYONCLOUD.COM MOBILE : 077 9664 6474