SlideShare a Scribd company logo
1 of 73
TURKCELL DAHİLİ

Hadoop
Big Data

TURKCELL DAHİLİ

Volume:

Velocity:
Often time sensitive , big data must be used
as it is streaming in to the enterprise it order
to maximize its value to the business.
Batch ,Near time , Real-time ,streams

Big data comes in one size : large .
Enterprises are awash with data ,easy
amassing terabytes and even petabytes of
information.
TB , Records , Transactions ,Tables , Files.

Value
Variety:

Verification:

Big data extends beyond structured data
, including semi-structured and unstructured
data to all varieties :text , audio , video ,click
streams ,log files and more

With all the big data there will be bad data
and with diverse data there will be more
diverse quality and security levels of users.

Structured , Unstructured , Semi-structured

Good , Undefined , bad , Inconsistency
, Incompleteness , Ambiguity
TURKCELL DAHİLİ

Big Data – Data Sources
TURKCELL DAHİLİ

Big Data – Data Growth
Hadoop Characteristics
•
•
•
•
•
•

Open Source
Distributed data replication
Commodity hardware
Data and analysis co-location
Scalability
Reliable error handling

TURKCELL DAHİLİ
Hadoop Storyline

TURKCELL DAHİLİ

2003 2006 2008 2009 2011 2012
Apache
Hadoop
Project started
for Yahoo
requirements

Google
published
GFS &
MapReduce
Paper

Cloudera
founded

First
commercial
Hadoop
Distribution
released.
Enterprise
support is
available

Hortonworks
founded

Ecosystem
reaches 300
companies
TURKCELL DAHİLİ

Hadoop for Enterprise
RDBMS vs. Hadoop

TURKCELL DAHİLİ
RDBMS vs. Hadoop

TURKCELL DAHİLİ

RDBMS

Hadoop

Data Size

Terabytes

Petabytes

Schema

Required on Write

Required on Read

Speed

Reads are fast

Writes are fast

Access

Interactive and Batch

Batch

Updates

Write and Read Many Times

Write once Read many times

Scaling

Scale up

Scale out

Data Types

Structured

Multi and unstructured

Integrity

High

Low

Best Use

Interactive OLAP Analytics
Complex ACID transactions
Operational Data Store

Data Discovery
Processing unstructured data
Massive storage/processing
TURKCELL DAHİLİ

Benefits of Analysing with Hadoop
• Previously impossible or impractical to do
analysis
• Analysis conducted at lower cost
• Greater flexibility
TURKCELL DAHİLİ

BigData &Hadoop in Turkcell
• Processing «Big Data» since 2009 with Cirrus
• Hadoop is on production since December’12
• ~4.5B records/~3.5TB data is processed with
Cirrus
• Data not stored for future analysis
• Cloudera Distribution for Hadoop (nonsupported)
• 5 x 24 core machines with SAN storage (not
reference arch)
TURKCELL DAHİLİ

Common Hadoop-able Problems
•
•
•
•
•

Modeling True Risk
Customer Churn Analysis
Recommendation Engine
Ad Targeting
Pos Transaction Analysis

• Analyze Network Data to
Predict failure
• Threat Analysis
• Search Quality
• Data ‘sandbox’
Modeling True Risk

TURKCELL DAHİLİ
Modeling True Risk

TURKCELL DAHİLİ

• Source, parse and aggregate disparate data
sources to build comprehensive data
picture
• E.g. Credit card records, call recordings, chat
sessions,emails, banking activity

• Structure and analyze
• Sentiment analysis, graph creation, pattern
recognition

• Typical industry
• Financial Services(Banks, Insurance)
TURKCELL DAHİLİ

Customer Churn Analysis
TURKCELL DAHİLİ

Customer Churn Analysis
• Rapidly test and build behavioral model of
customer from disparate sources
• Structure and analyse with Hadoop
• Traversing
• Graph creation
• Pattern recognition

• Typical Industry
• Telecommunications, Financial Services
TURKCELL DAHİLİ

Recommendation Engine
TURKCELL DAHİLİ

Recommendation Engine
• Batch processing framework
• Allow execution in parallel over large datasets

• Collaborative filtering
• Collecting ‘taste’ information from many users
• Utilizing information to predict what similar
users like

• Typical industry
• Ecommerce, Manufacturing, Retail
Ad Targeting

TURKCELL DAHİLİ
Ad Targeting

TURKCELL DAHİLİ

• Data analysis can be conducted in parallel,
reducing processing times from days to
hours
• With hadoop, as data volumes grow the
only expansion cost is hardware
• Add more nodes without degradation in
performance
• Typical Industry
• Advertising
TURKCELL DAHİLİ

Point of Sale Transaction Analysis
TURKCELL DAHİLİ

Point of Sale Transaction Analysis
• Batch processing framework
• Allow execution in parallel over large datasets

• Pattern Recognition
• Optimizing over multiple data sources
• Utilizing information to predict demand

• Typical Industry
• Retail
Analyzing Network Data to PredictTURKCELL DAHİLİ
Failure
Analyzing Network Data to PredictTURKCELL DAHİLİ
Failure
• Take the computation to the data
• Extending the range of indexing techniques
from simple scans to more complex data mining

• Better understand how network reacts to
fluctuations
• How previously thought discrete anomalies may,
in fact, be interconnected

• Identify leading indicators of components
• Typical Industry
• Utilities, Telecommunications, Datacenters
Threat
Analysis

TURKCELL DAHİLİ
Threat Analysis
• Parallel processing over huge datasets
• Pattern recognition to identify anomalies
i.e. Threats
• Typical Industry
• Security, Financial Services, click fraud..

TURKCELL DAHİLİ
Search Quality

TURKCELL DAHİLİ
Search Quality
• Analysing search attempts in conjunction
with structured data
• Pattern recognition
• Browsing pattern of users performing searches
in different categories

• Typical Industry
• Web
• Ecommerce

TURKCELL DAHİLİ
Data ‘Sandbox’

TURKCELL DAHİLİ
Data ‘Sandbox’

TURKCELL DAHİLİ

• With Hadoop an organization can dump all
this data into HDFS cluster
• Then use Hadoop to start trying out
different analysis on data
• See patterns or relationships that allow the
organization to derive additional value
from data
• Typical Industry
• Common across all industries
TURKCELL DAHİLİ

Hadoop Core
Apache Hadoop Core

TURKCELL DAHİLİ

• Hadoop is a distributed storage and processing
technology for large scale applications
• HDFS: Self healing, distributed file system for
multi-structured data; breaks files into blocks &
stores redundantly across cluster.
• Map Reduce: Framework for running large data
processing jobs in parallel across many nodes &
combining results.
Master/Slave Model

TURKCELL DAHİLİ
TURKCELL DAHİLİ

Hadoop Distributed File System

• The Hadoop Distributed File System (HDFS) stores
files across all of the nodes in a Hadoop cluster.
• It handles breaking the files into large blocks and
distributing them across different machines.
• It also makes multiple copies of each block so that
if any one machine fails, no data is lost or
unavailable.
HDFS- Features
•
•
•
•
•

TURKCELL DAHİLİ

Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
TURKCELL DAHİLİ

Hadoop Distributed File System
• The brain of HDFS is the NameNode.
•
•
•
•

Maintains the master list of files in HDFS
Handles mapping of filenames to blocks
Knows where each block is stored
Ensure each block is replicated the appropriate number
of times.

• DataNodes are machines that store HDFS data.
• Each DataNode is colocated with a TaskTracker to
allow moving of the computation to data.
HDFS-Design

TURKCELL DAHİLİ

• Very Large files
• Streaming data access
• Time to read the whole file is more important than the
reading the first record
• Commodity hardware

• Optimized for high throughput
• Not fit for
• Low latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
HDFS architecture

TURKCELL DAHİLİ
MapReduce

TURKCELL DAHİLİ

• MapReduce is the framework for running jobs
in Hadoop. It provides a simple and powerful
paradigm for parallelizing data processing.
• The JobTracker is the central coordinator of
jobs in MapReduce. It controls which jobs are
being run, which resources they are assigned,
etc.
• On each node in the cluster there is a
TaskTracker that is responsible for running the
map or reduce tasks assigned to it by the
JobTracker.
TURKCELL DAHİLİ

Hadoop Ecosystem
Hadoop Ecosystem

TURKCELL DAHİLİ
YARN

TURKCELL DAHİLİ
YARN

TURKCELL DAHİLİ

• The YARN resource manager, which coordinates the
allocation of compute resources on the cluster.
• The YARN node managers, which launch and monitor
the compute containers on machines in the cluster.
• The MapReduce application master, which
coordinates the tasks running the MapReduce job.
The application master and the MapReduce tasks run
in containers that are scheduled by the resource
manager and managed by the node managers.
Pig

TURKCELL DAHİLİ

• Pig provides an engine for executing data flows in
parallel on Hadoop.
• PigLatin is a simple-to-understand data flow
language used in the analysis of large data sets.
• Pig scripts are automatically converted into
MapReduce jobs by the Pig interpreter
• Pig has an optimizer that rearranges some
operations in Pig Latin scripts to give better
performance, combines MapReduce jobs together
Hive

TURKCELL DAHİLİ

• Is a datawarehouse system layer built on Hadoop
• Allows you to define a structure for your
unstructured Big Data
• Simplifies analysis and queries with an SQL like
scripting language called HiveQL
• Produces MapReduce jobs in background
• Extensible (UDFs,UDAFs,UDTFs)
• Support uses such as:
• Adhoc queries
• Summarization
• Data Analysis
Hive is not

TURKCELL DAHİLİ

• … a relational database
• … designed for online transaction processing
• … suited for realtime queries and row-level
updates
Stinger for Hive

TURKCELL DAHİLİ
Ambari

TURKCELL DAHİLİ

• Ambari for Hadoop
Clusters
• Provision
• Manage
• Monitor
Ambari

TURKCELL DAHİLİ

• Provides step-by-step
wizard for installing
Hadoop services
across any number of
hosts
• Handles configuration
of Hadoop services for
the cluster.
Sqoop and Flume

TURKCELL DAHİLİ

• Apache Sqoop(TM) is a tool designed for
efficiently transferring bulk data between Apache
Hadoop and structured datastores such as
relational databases.
• Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large quantities of streaming
data(ex:logs) into HDFS. It has a simple and
flexible architecture based on streaming data
flows.
Schemas - HCatalog

TURKCELL DAHİLİ

• A table and storage management service for data
created using Apache Hadoop
• Providing a shared schema and data type mechanism.
• Providing a table abstraction so that users need not
be concerned with where or how their data is stored.
• Providing interoperability across data processing
tools such as Pig, Map Reduce, and Hive.
• Example
• stocks_daily= load ‘nyse_daily' using HCatLoader();
• cleansed = filter stocks_daily by symbol is not null;
Mahout

TURKCELL DAHİLİ

• The Apache Mahout™ machine learning library's
goal is to build scalable machine learning
libraries.
• Core algorithms for clustering, classification and
batch based collaborative filtering are
implemented on top of Apache Hadoop using the
map/reduce paradigm.
• The core libraries are highly optimized to allow
for good performance also for non-distributed
algorithms
TURKCELL DAHİLİ

Hadoop Core in Detail
Map Phase

TURKCELL DAHİLİ

• In the map phase, MapReduce gives the user an
opportunity to operate on every record in the
data set individually. This phase is commonly
used to project out unwanted fields, transform
fields, or apply filters.
• Certain types of joins and grouping can also be
done in the map (e.g., joins where the data is
already sorted or hash-based aggregation).
Data Locality

TURKCELL DAHİLİ
Combiner Phase

TURKCELL DAHİLİ

• Minimize the data transferred between map and
reduce tasks.
• The combiner gives applications a chance to apply
their reducer logic early on.
Shuffle Phase

TURKCELL DAHİLİ

• Data arriving on the reducer has been partitioned
and sorted by the map, combine, and shuffle
phases.
• By default, the data is sorted by the partition key.
For example, if a user has a data set partitioned
on user ID, in the reducer it will be sorted by user
ID as well. Thus, MapReduce uses sorting to
group like keys together.
• It is possible to specify additional sort keys
beyond the partition key
Shuffle

TURKCELL DAHİLİ
Reduce Phase

TURKCELL DAHİLİ

• The input to the reduce phase is each key from
the shuffle plus all of the records associated with
that key.
• Because all records with the same value for the
key are now collected together, it is possible to do
joins and aggregation operations such as
counting.
• The MapReduce user explicitly controls
parallelism in the reduce.
Reduce Phase

TURKCELL DAHİLİ
Output Phase

TURKCELL DAHİLİ

• The reducer (or map in a map-only job) writes its
output via an OutputFormat.
• OutputFormat is responsible for providing a
RecordWriter, which takes the key-value pairs
produced by the task and stores them.
• This includes serializing, possibly compressing,
and writing them to HDFS, HBase, etc
TURKCELL DAHİLİ

Map Reduce Logical Flow
TURKCELL DAHİLİ

Map Reduce Logical Flow
TURKCELL DAHİLİ

MapReduce Processing Model
Speculative Execution

TURKCELL DAHİLİ

• If a Mapper runs slower than the others, a new
instance of the Mapper will be started on another
machine operating on the same data.
• The result of the first Mapper to finish will be
used.
• Hadoop will kill of the Mapper which is still
running.
Distributed Cache

TURKCELL DAHİLİ

• Sometimes all or many of the tasks in a MapReduce job
will need to access a single file or a set of files.
• When thousands of map or reduce tasks attempt to open
the same HDFS file simultaneously, this puts a large strain
on the NameNode and the DataNodes storing that file.
• To avoid this situation, MapReduce provides the
distributed cache.
• The distributed cache allows users to specify—as part of
their MapReduce job—any HDFS files they want every
task to have access to.
• These files are then copied onto the local disk of the task
nodes as part of the task initiation. Map or reduce tasks
can then read these as local files.
TURKCELL DAHİLİ

Setting up Environment

• Hortonworks Sandbox:
http://hortonworks.com/products/sandboxinstructions/
• VMware:
http://www.vmware.com/products/player/overvi
ew.html
• Setup Guide:http://hortonworks.com/wpcontent/uploads/2013/03/InstallingHortonworks
SandboxonWindowsUsingVMwarePlayerv2.pdf
TURKCELL DAHİLİ

Hortonworks Sandbox
TURKCELL DAHİLİ

Hortonworks Sandbox
MapReduce Demo
• Eclipse Plugin:
• HDFS Operations
• Running WordCount, TopK
• Generating jars for HDP Sandbox

• Sandbox:
• HDFS Operations
• Loading and Running jar files
• Oozie and Ambari

TURKCELL DAHİLİ
Hive Demo
•
•
•
•
•

Create Table with HCatalog
Load Data in to Hive
Query Data
OUTPUT to Table/HDFS/Local
JOIN

TURKCELL DAHİLİ
Pig Demo
•
•
•
•

Load Data
Transform
Grouping
JOIN

TURKCELL DAHİLİ
TURKCELL DAHİLİ

Teşekkürler

More Related Content

What's hot

Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 

What's hot (19)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 

Similar to Hadoop Technology

Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAshrith Mekala
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0Amr Kamel Deklel
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 

Similar to Hadoop Technology (20)

Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Hadoop Technology

  • 2. Big Data TURKCELL DAHİLİ Volume: Velocity: Often time sensitive , big data must be used as it is streaming in to the enterprise it order to maximize its value to the business. Batch ,Near time , Real-time ,streams Big data comes in one size : large . Enterprises are awash with data ,easy amassing terabytes and even petabytes of information. TB , Records , Transactions ,Tables , Files. Value Variety: Verification: Big data extends beyond structured data , including semi-structured and unstructured data to all varieties :text , audio , video ,click streams ,log files and more With all the big data there will be bad data and with diverse data there will be more diverse quality and security levels of users. Structured , Unstructured , Semi-structured Good , Undefined , bad , Inconsistency , Incompleteness , Ambiguity
  • 3. TURKCELL DAHİLİ Big Data – Data Sources
  • 4. TURKCELL DAHİLİ Big Data – Data Growth
  • 5. Hadoop Characteristics • • • • • • Open Source Distributed data replication Commodity hardware Data and analysis co-location Scalability Reliable error handling TURKCELL DAHİLİ
  • 6. Hadoop Storyline TURKCELL DAHİLİ 2003 2006 2008 2009 2011 2012 Apache Hadoop Project started for Yahoo requirements Google published GFS & MapReduce Paper Cloudera founded First commercial Hadoop Distribution released. Enterprise support is available Hortonworks founded Ecosystem reaches 300 companies
  • 9. RDBMS vs. Hadoop TURKCELL DAHİLİ RDBMS Hadoop Data Size Terabytes Petabytes Schema Required on Write Required on Read Speed Reads are fast Writes are fast Access Interactive and Batch Batch Updates Write and Read Many Times Write once Read many times Scaling Scale up Scale out Data Types Structured Multi and unstructured Integrity High Low Best Use Interactive OLAP Analytics Complex ACID transactions Operational Data Store Data Discovery Processing unstructured data Massive storage/processing
  • 10. TURKCELL DAHİLİ Benefits of Analysing with Hadoop • Previously impossible or impractical to do analysis • Analysis conducted at lower cost • Greater flexibility
  • 11. TURKCELL DAHİLİ BigData &Hadoop in Turkcell • Processing «Big Data» since 2009 with Cirrus • Hadoop is on production since December’12 • ~4.5B records/~3.5TB data is processed with Cirrus • Data not stored for future analysis • Cloudera Distribution for Hadoop (nonsupported) • 5 x 24 core machines with SAN storage (not reference arch)
  • 12. TURKCELL DAHİLİ Common Hadoop-able Problems • • • • • Modeling True Risk Customer Churn Analysis Recommendation Engine Ad Targeting Pos Transaction Analysis • Analyze Network Data to Predict failure • Threat Analysis • Search Quality • Data ‘sandbox’
  • 14. Modeling True Risk TURKCELL DAHİLİ • Source, parse and aggregate disparate data sources to build comprehensive data picture • E.g. Credit card records, call recordings, chat sessions,emails, banking activity • Structure and analyze • Sentiment analysis, graph creation, pattern recognition • Typical industry • Financial Services(Banks, Insurance)
  • 16. TURKCELL DAHİLİ Customer Churn Analysis • Rapidly test and build behavioral model of customer from disparate sources • Structure and analyse with Hadoop • Traversing • Graph creation • Pattern recognition • Typical Industry • Telecommunications, Financial Services
  • 18. TURKCELL DAHİLİ Recommendation Engine • Batch processing framework • Allow execution in parallel over large datasets • Collaborative filtering • Collecting ‘taste’ information from many users • Utilizing information to predict what similar users like • Typical industry • Ecommerce, Manufacturing, Retail
  • 20. Ad Targeting TURKCELL DAHİLİ • Data analysis can be conducted in parallel, reducing processing times from days to hours • With hadoop, as data volumes grow the only expansion cost is hardware • Add more nodes without degradation in performance • Typical Industry • Advertising
  • 21. TURKCELL DAHİLİ Point of Sale Transaction Analysis
  • 22. TURKCELL DAHİLİ Point of Sale Transaction Analysis • Batch processing framework • Allow execution in parallel over large datasets • Pattern Recognition • Optimizing over multiple data sources • Utilizing information to predict demand • Typical Industry • Retail
  • 23. Analyzing Network Data to PredictTURKCELL DAHİLİ Failure
  • 24. Analyzing Network Data to PredictTURKCELL DAHİLİ Failure • Take the computation to the data • Extending the range of indexing techniques from simple scans to more complex data mining • Better understand how network reacts to fluctuations • How previously thought discrete anomalies may, in fact, be interconnected • Identify leading indicators of components • Typical Industry • Utilities, Telecommunications, Datacenters
  • 26. Threat Analysis • Parallel processing over huge datasets • Pattern recognition to identify anomalies i.e. Threats • Typical Industry • Security, Financial Services, click fraud.. TURKCELL DAHİLİ
  • 28. Search Quality • Analysing search attempts in conjunction with structured data • Pattern recognition • Browsing pattern of users performing searches in different categories • Typical Industry • Web • Ecommerce TURKCELL DAHİLİ
  • 30. Data ‘Sandbox’ TURKCELL DAHİLİ • With Hadoop an organization can dump all this data into HDFS cluster • Then use Hadoop to start trying out different analysis on data • See patterns or relationships that allow the organization to derive additional value from data • Typical Industry • Common across all industries
  • 32. Apache Hadoop Core TURKCELL DAHİLİ • Hadoop is a distributed storage and processing technology for large scale applications • HDFS: Self healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across cluster. • Map Reduce: Framework for running large data processing jobs in parallel across many nodes & combining results.
  • 34. TURKCELL DAHİLİ Hadoop Distributed File System • The Hadoop Distributed File System (HDFS) stores files across all of the nodes in a Hadoop cluster. • It handles breaking the files into large blocks and distributing them across different machines. • It also makes multiple copies of each block so that if any one machine fails, no data is lost or unavailable.
  • 35. HDFS- Features • • • • • TURKCELL DAHİLİ Highly fault-tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware
  • 36. TURKCELL DAHİLİ Hadoop Distributed File System • The brain of HDFS is the NameNode. • • • • Maintains the master list of files in HDFS Handles mapping of filenames to blocks Knows where each block is stored Ensure each block is replicated the appropriate number of times. • DataNodes are machines that store HDFS data. • Each DataNode is colocated with a TaskTracker to allow moving of the computation to data.
  • 37. HDFS-Design TURKCELL DAHİLİ • Very Large files • Streaming data access • Time to read the whole file is more important than the reading the first record • Commodity hardware • Optimized for high throughput • Not fit for • Low latency data access • Lots of small files • Multiple writers, arbitrary file modifications
  • 39. MapReduce TURKCELL DAHİLİ • MapReduce is the framework for running jobs in Hadoop. It provides a simple and powerful paradigm for parallelizing data processing. • The JobTracker is the central coordinator of jobs in MapReduce. It controls which jobs are being run, which resources they are assigned, etc. • On each node in the cluster there is a TaskTracker that is responsible for running the map or reduce tasks assigned to it by the JobTracker.
  • 43. YARN TURKCELL DAHİLİ • The YARN resource manager, which coordinates the allocation of compute resources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
  • 44. Pig TURKCELL DAHİLİ • Pig provides an engine for executing data flows in parallel on Hadoop. • PigLatin is a simple-to-understand data flow language used in the analysis of large data sets. • Pig scripts are automatically converted into MapReduce jobs by the Pig interpreter • Pig has an optimizer that rearranges some operations in Pig Latin scripts to give better performance, combines MapReduce jobs together
  • 45. Hive TURKCELL DAHİLİ • Is a datawarehouse system layer built on Hadoop • Allows you to define a structure for your unstructured Big Data • Simplifies analysis and queries with an SQL like scripting language called HiveQL • Produces MapReduce jobs in background • Extensible (UDFs,UDAFs,UDTFs) • Support uses such as: • Adhoc queries • Summarization • Data Analysis
  • 46. Hive is not TURKCELL DAHİLİ • … a relational database • … designed for online transaction processing • … suited for realtime queries and row-level updates
  • 48. Ambari TURKCELL DAHİLİ • Ambari for Hadoop Clusters • Provision • Manage • Monitor
  • 49. Ambari TURKCELL DAHİLİ • Provides step-by-step wizard for installing Hadoop services across any number of hosts • Handles configuration of Hadoop services for the cluster.
  • 50. Sqoop and Flume TURKCELL DAHİLİ • Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large quantities of streaming data(ex:logs) into HDFS. It has a simple and flexible architecture based on streaming data flows.
  • 51. Schemas - HCatalog TURKCELL DAHİLİ • A table and storage management service for data created using Apache Hadoop • Providing a shared schema and data type mechanism. • Providing a table abstraction so that users need not be concerned with where or how their data is stored. • Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive. • Example • stocks_daily= load ‘nyse_daily' using HCatLoader(); • cleansed = filter stocks_daily by symbol is not null;
  • 52. Mahout TURKCELL DAHİLİ • The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries. • Core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. • The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
  • 54. Map Phase TURKCELL DAHİLİ • In the map phase, MapReduce gives the user an opportunity to operate on every record in the data set individually. This phase is commonly used to project out unwanted fields, transform fields, or apply filters. • Certain types of joins and grouping can also be done in the map (e.g., joins where the data is already sorted or hash-based aggregation).
  • 56. Combiner Phase TURKCELL DAHİLİ • Minimize the data transferred between map and reduce tasks. • The combiner gives applications a chance to apply their reducer logic early on.
  • 57. Shuffle Phase TURKCELL DAHİLİ • Data arriving on the reducer has been partitioned and sorted by the map, combine, and shuffle phases. • By default, the data is sorted by the partition key. For example, if a user has a data set partitioned on user ID, in the reducer it will be sorted by user ID as well. Thus, MapReduce uses sorting to group like keys together. • It is possible to specify additional sort keys beyond the partition key
  • 59. Reduce Phase TURKCELL DAHİLİ • The input to the reduce phase is each key from the shuffle plus all of the records associated with that key. • Because all records with the same value for the key are now collected together, it is possible to do joins and aggregation operations such as counting. • The MapReduce user explicitly controls parallelism in the reduce.
  • 61. Output Phase TURKCELL DAHİLİ • The reducer (or map in a map-only job) writes its output via an OutputFormat. • OutputFormat is responsible for providing a RecordWriter, which takes the key-value pairs produced by the task and stores them. • This includes serializing, possibly compressing, and writing them to HDFS, HBase, etc
  • 65. Speculative Execution TURKCELL DAHİLİ • If a Mapper runs slower than the others, a new instance of the Mapper will be started on another machine operating on the same data. • The result of the first Mapper to finish will be used. • Hadoop will kill of the Mapper which is still running.
  • 66. Distributed Cache TURKCELL DAHİLİ • Sometimes all or many of the tasks in a MapReduce job will need to access a single file or a set of files. • When thousands of map or reduce tasks attempt to open the same HDFS file simultaneously, this puts a large strain on the NameNode and the DataNodes storing that file. • To avoid this situation, MapReduce provides the distributed cache. • The distributed cache allows users to specify—as part of their MapReduce job—any HDFS files they want every task to have access to. • These files are then copied onto the local disk of the task nodes as part of the task initiation. Map or reduce tasks can then read these as local files.
  • 67. TURKCELL DAHİLİ Setting up Environment • Hortonworks Sandbox: http://hortonworks.com/products/sandboxinstructions/ • VMware: http://www.vmware.com/products/player/overvi ew.html • Setup Guide:http://hortonworks.com/wpcontent/uploads/2013/03/InstallingHortonworks SandboxonWindowsUsingVMwarePlayerv2.pdf
  • 70. MapReduce Demo • Eclipse Plugin: • HDFS Operations • Running WordCount, TopK • Generating jars for HDP Sandbox • Sandbox: • HDFS Operations • Loading and Running jar files • Oozie and Ambari TURKCELL DAHİLİ
  • 71. Hive Demo • • • • • Create Table with HCatalog Load Data in to Hive Query Data OUTPUT to Table/HDFS/Local JOIN TURKCELL DAHİLİ