2. Big Data
TURKCELL DAHİLİ
Volume:
Velocity:
Often time sensitive , big data must be used
as it is streaming in to the enterprise it order
to maximize its value to the business.
Batch ,Near time , Real-time ,streams
Big data comes in one size : large .
Enterprises are awash with data ,easy
amassing terabytes and even petabytes of
information.
TB , Records , Transactions ,Tables , Files.
Value
Variety:
Verification:
Big data extends beyond structured data
, including semi-structured and unstructured
data to all varieties :text , audio , video ,click
streams ,log files and more
With all the big data there will be bad data
and with diverse data there will be more
diverse quality and security levels of users.
Structured , Unstructured , Semi-structured
Good , Undefined , bad , Inconsistency
, Incompleteness , Ambiguity
6. Hadoop Storyline
TURKCELL DAHİLİ
2003 2006 2008 2009 2011 2012
Apache
Hadoop
Project started
for Yahoo
requirements
Google
published
GFS &
MapReduce
Paper
Cloudera
founded
First
commercial
Hadoop
Distribution
released.
Enterprise
support is
available
Hortonworks
founded
Ecosystem
reaches 300
companies
9. RDBMS vs. Hadoop
TURKCELL DAHİLİ
RDBMS
Hadoop
Data Size
Terabytes
Petabytes
Schema
Required on Write
Required on Read
Speed
Reads are fast
Writes are fast
Access
Interactive and Batch
Batch
Updates
Write and Read Many Times
Write once Read many times
Scaling
Scale up
Scale out
Data Types
Structured
Multi and unstructured
Integrity
High
Low
Best Use
Interactive OLAP Analytics
Complex ACID transactions
Operational Data Store
Data Discovery
Processing unstructured data
Massive storage/processing
10. TURKCELL DAHİLİ
Benefits of Analysing with Hadoop
• Previously impossible or impractical to do
analysis
• Analysis conducted at lower cost
• Greater flexibility
11. TURKCELL DAHİLİ
BigData &Hadoop in Turkcell
• Processing «Big Data» since 2009 with Cirrus
• Hadoop is on production since December’12
• ~4.5B records/~3.5TB data is processed with
Cirrus
• Data not stored for future analysis
• Cloudera Distribution for Hadoop (nonsupported)
• 5 x 24 core machines with SAN storage (not
reference arch)
12. TURKCELL DAHİLİ
Common Hadoop-able Problems
•
•
•
•
•
Modeling True Risk
Customer Churn Analysis
Recommendation Engine
Ad Targeting
Pos Transaction Analysis
• Analyze Network Data to
Predict failure
• Threat Analysis
• Search Quality
• Data ‘sandbox’
16. TURKCELL DAHİLİ
Customer Churn Analysis
• Rapidly test and build behavioral model of
customer from disparate sources
• Structure and analyse with Hadoop
• Traversing
• Graph creation
• Pattern recognition
• Typical Industry
• Telecommunications, Financial Services
18. TURKCELL DAHİLİ
Recommendation Engine
• Batch processing framework
• Allow execution in parallel over large datasets
• Collaborative filtering
• Collecting ‘taste’ information from many users
• Utilizing information to predict what similar
users like
• Typical industry
• Ecommerce, Manufacturing, Retail
20. Ad Targeting
TURKCELL DAHİLİ
• Data analysis can be conducted in parallel,
reducing processing times from days to
hours
• With hadoop, as data volumes grow the
only expansion cost is hardware
• Add more nodes without degradation in
performance
• Typical Industry
• Advertising
22. TURKCELL DAHİLİ
Point of Sale Transaction Analysis
• Batch processing framework
• Allow execution in parallel over large datasets
• Pattern Recognition
• Optimizing over multiple data sources
• Utilizing information to predict demand
• Typical Industry
• Retail
24. Analyzing Network Data to PredictTURKCELL DAHİLİ
Failure
• Take the computation to the data
• Extending the range of indexing techniques
from simple scans to more complex data mining
• Better understand how network reacts to
fluctuations
• How previously thought discrete anomalies may,
in fact, be interconnected
• Identify leading indicators of components
• Typical Industry
• Utilities, Telecommunications, Datacenters
28. Search Quality
• Analysing search attempts in conjunction
with structured data
• Pattern recognition
• Browsing pattern of users performing searches
in different categories
• Typical Industry
• Web
• Ecommerce
TURKCELL DAHİLİ
30. Data ‘Sandbox’
TURKCELL DAHİLİ
• With Hadoop an organization can dump all
this data into HDFS cluster
• Then use Hadoop to start trying out
different analysis on data
• See patterns or relationships that allow the
organization to derive additional value
from data
• Typical Industry
• Common across all industries
32. Apache Hadoop Core
TURKCELL DAHİLİ
• Hadoop is a distributed storage and processing
technology for large scale applications
• HDFS: Self healing, distributed file system for
multi-structured data; breaks files into blocks &
stores redundantly across cluster.
• Map Reduce: Framework for running large data
processing jobs in parallel across many nodes &
combining results.
34. TURKCELL DAHİLİ
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) stores
files across all of the nodes in a Hadoop cluster.
• It handles breaking the files into large blocks and
distributing them across different machines.
• It also makes multiple copies of each block so that
if any one machine fails, no data is lost or
unavailable.
35. HDFS- Features
•
•
•
•
•
TURKCELL DAHİLİ
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
36. TURKCELL DAHİLİ
Hadoop Distributed File System
• The brain of HDFS is the NameNode.
•
•
•
•
Maintains the master list of files in HDFS
Handles mapping of filenames to blocks
Knows where each block is stored
Ensure each block is replicated the appropriate number
of times.
• DataNodes are machines that store HDFS data.
• Each DataNode is colocated with a TaskTracker to
allow moving of the computation to data.
37. HDFS-Design
TURKCELL DAHİLİ
• Very Large files
• Streaming data access
• Time to read the whole file is more important than the
reading the first record
• Commodity hardware
• Optimized for high throughput
• Not fit for
• Low latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
39. MapReduce
TURKCELL DAHİLİ
• MapReduce is the framework for running jobs
in Hadoop. It provides a simple and powerful
paradigm for parallelizing data processing.
• The JobTracker is the central coordinator of
jobs in MapReduce. It controls which jobs are
being run, which resources they are assigned,
etc.
• On each node in the cluster there is a
TaskTracker that is responsible for running the
map or reduce tasks assigned to it by the
JobTracker.
43. YARN
TURKCELL DAHİLİ
• The YARN resource manager, which coordinates the
allocation of compute resources on the cluster.
• The YARN node managers, which launch and monitor
the compute containers on machines in the cluster.
• The MapReduce application master, which
coordinates the tasks running the MapReduce job.
The application master and the MapReduce tasks run
in containers that are scheduled by the resource
manager and managed by the node managers.
44. Pig
TURKCELL DAHİLİ
• Pig provides an engine for executing data flows in
parallel on Hadoop.
• PigLatin is a simple-to-understand data flow
language used in the analysis of large data sets.
• Pig scripts are automatically converted into
MapReduce jobs by the Pig interpreter
• Pig has an optimizer that rearranges some
operations in Pig Latin scripts to give better
performance, combines MapReduce jobs together
45. Hive
TURKCELL DAHİLİ
• Is a datawarehouse system layer built on Hadoop
• Allows you to define a structure for your
unstructured Big Data
• Simplifies analysis and queries with an SQL like
scripting language called HiveQL
• Produces MapReduce jobs in background
• Extensible (UDFs,UDAFs,UDTFs)
• Support uses such as:
• Adhoc queries
• Summarization
• Data Analysis
46. Hive is not
TURKCELL DAHİLİ
• … a relational database
• … designed for online transaction processing
• … suited for realtime queries and row-level
updates
49. Ambari
TURKCELL DAHİLİ
• Provides step-by-step
wizard for installing
Hadoop services
across any number of
hosts
• Handles configuration
of Hadoop services for
the cluster.
50. Sqoop and Flume
TURKCELL DAHİLİ
• Apache Sqoop(TM) is a tool designed for
efficiently transferring bulk data between Apache
Hadoop and structured datastores such as
relational databases.
• Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large quantities of streaming
data(ex:logs) into HDFS. It has a simple and
flexible architecture based on streaming data
flows.
51. Schemas - HCatalog
TURKCELL DAHİLİ
• A table and storage management service for data
created using Apache Hadoop
• Providing a shared schema and data type mechanism.
• Providing a table abstraction so that users need not
be concerned with where or how their data is stored.
• Providing interoperability across data processing
tools such as Pig, Map Reduce, and Hive.
• Example
• stocks_daily= load ‘nyse_daily' using HCatLoader();
• cleansed = filter stocks_daily by symbol is not null;
52. Mahout
TURKCELL DAHİLİ
• The Apache Mahout™ machine learning library's
goal is to build scalable machine learning
libraries.
• Core algorithms for clustering, classification and
batch based collaborative filtering are
implemented on top of Apache Hadoop using the
map/reduce paradigm.
• The core libraries are highly optimized to allow
for good performance also for non-distributed
algorithms
54. Map Phase
TURKCELL DAHİLİ
• In the map phase, MapReduce gives the user an
opportunity to operate on every record in the
data set individually. This phase is commonly
used to project out unwanted fields, transform
fields, or apply filters.
• Certain types of joins and grouping can also be
done in the map (e.g., joins where the data is
already sorted or hash-based aggregation).
56. Combiner Phase
TURKCELL DAHİLİ
• Minimize the data transferred between map and
reduce tasks.
• The combiner gives applications a chance to apply
their reducer logic early on.
57. Shuffle Phase
TURKCELL DAHİLİ
• Data arriving on the reducer has been partitioned
and sorted by the map, combine, and shuffle
phases.
• By default, the data is sorted by the partition key.
For example, if a user has a data set partitioned
on user ID, in the reducer it will be sorted by user
ID as well. Thus, MapReduce uses sorting to
group like keys together.
• It is possible to specify additional sort keys
beyond the partition key
59. Reduce Phase
TURKCELL DAHİLİ
• The input to the reduce phase is each key from
the shuffle plus all of the records associated with
that key.
• Because all records with the same value for the
key are now collected together, it is possible to do
joins and aggregation operations such as
counting.
• The MapReduce user explicitly controls
parallelism in the reduce.
61. Output Phase
TURKCELL DAHİLİ
• The reducer (or map in a map-only job) writes its
output via an OutputFormat.
• OutputFormat is responsible for providing a
RecordWriter, which takes the key-value pairs
produced by the task and stores them.
• This includes serializing, possibly compressing,
and writing them to HDFS, HBase, etc
65. Speculative Execution
TURKCELL DAHİLİ
• If a Mapper runs slower than the others, a new
instance of the Mapper will be started on another
machine operating on the same data.
• The result of the first Mapper to finish will be
used.
• Hadoop will kill of the Mapper which is still
running.
66. Distributed Cache
TURKCELL DAHİLİ
• Sometimes all or many of the tasks in a MapReduce job
will need to access a single file or a set of files.
• When thousands of map or reduce tasks attempt to open
the same HDFS file simultaneously, this puts a large strain
on the NameNode and the DataNodes storing that file.
• To avoid this situation, MapReduce provides the
distributed cache.
• The distributed cache allows users to specify—as part of
their MapReduce job—any HDFS files they want every
task to have access to.
• These files are then copied onto the local disk of the task
nodes as part of the task initiation. Map or reduce tasks
can then read these as local files.