Big Data
Zekeriya Beşiroğlu
http://zekeriyabesiroglu.com
http://bilginc.com
http://twitter.com/zbesiroglu
Zekeriya Besiroglu
• Bilginc IT Academy - Expert
Consultant
• + 16 IT
• +14 ORACLE DB/DWH
• +7 WEBLOGIC
• +3 BIG DATA
• TROUG
• Speaker
Bilginc IT Academy
DATA TRENS
- Facebook has around 60 PB
warehouse and it’s constantly growing
- Twitter messages are 140 bytes each
generating 8TB data per day.
-Data is more than doubling every
year.

-Almost 80% of data will be
unstructured data.
-Amazon: 35% of product sales come
from product recommendations
New Type of DATA?
• Sentiment : Understand how your customers feel about
your products / company
• Sensor/Machine:Discover patters in data streaming
automatically from sensors and machines.
• Unstructured: text,video,pictures.
• Server Logs:Search logs find pattern
• Geographic:Analyze location-based data
• Clickstream:Capture and analyze website visitors data
Big Data
https://www.youtube.com/watch?v=1GU4Imbo6R8
Capacity vs Cost
Year Capacity(GB) Cost per GB(USD)
1990 0.10 $4000
1997 2 $150
2002 80 $3.75
2007 750 $0.35
2012 3.000 $0.05
2015 10.000 $0.02
What is Big Data
• Big Data is When the Volume,Velocity,Variety of
data gets to the point where it is too difficult/
expensive for traditional systems to work with.
3Vs of Big Data
Traditional Large scale
Computing System Problems
• Computation has been
processor bound
• Relatively small amount
of data
• Complex processing
• Need bigger computers
• More memory,More/fast
processor
Better Solution
• Distributed Systems- Multiple
machine run for single job
Problem Of Distributed Systems
Data Stored central location
Data Copied processor runtime
Todays
• Total Data size PetaBytes
• Daily Terabytes
We Need New Solution
HADOOP
HADOOP
• Distribute the Data when it is stored
SPARK
Data is Distributed in Memory
RDBMS vs HADOOP
Hadoop
• Hadoop consist of two component
• HDFS
• Map Reduce
• Hadoop ecosystem
• Pig,Hive,Hbase,Flume,Oozie,Sqoop,etc
Traditional ETL
Source Layer
Structured Data DWH Data Mart
ETL/ELT ETL/ELT
Hadoop ETL
Source Layer
Structured Data
UnStructed Data
DWH Data MartHADOOP
HDFS
• Hadoop Distributed File System:Storing data
• Data Split into blocks. 64 Mb…
• Each Block replicated e.g 3 times. replicas store different
nodes.
• Based on Google File system
• ext3,ext4,xfs
• No random writes allowed. Prefer large streaming reads
HDFS
HDFS
• hadoop fs -ls (user home directory)
• hadoop fs -ls / (root directory)
• hadoop fs -cat /user/zekeriya/deneme.txt
• hadoop fs -mkdir
• hadoop fs -rm -r veri
MapReduce
• Process Data in the Hadoop Cluster
• Two Stage MAP and REDUCE
MAPREDUCE
map(String input_key, String input_value)
foreach word w in input_value:
emit(w, 1)
reduce(String output_key,
Iterator<int> intermediate_vals)
set count = 0
foreach v in intermediate_vals:
count += v
emit(output_key, count)
(1000,’Galatasaray sampiyon olur’)
(2000,’beşiktas sampiyon olur’)
(2200,’Galatasaray Türkiyedir’)
MAPREDUCE
Output Mapper
(‘Galatasaray’, 1), (‘sampiyon’, 1), (‘olur’, 1), (‘beşiktas’, 1),
(‘sampiyon, 1), (‘olur’, 1), (‘Galatasaray’, 1), (‘Türkiyedir’, 1)
Intermediate Data Reducer’a gönderilen
(‘Galatasaray’,[1,1])
(‘sampiyon’,[1,1])
(‘olur’,[1])
(‘beşiktas’,[1])
(‘Türkiyedir’,[1])
Reducer’ın son cıktısı
(‘Galatasaray’,2)
(‘sampiyon’,2)
(‘olur’,1)
(‘beşiktas’,1)
(‘Türkiyedir’,1)
Hadoop Ecosystem
• HIVE
• LIKE SQL
• User query data in hadoop cluster without knowing Java and Map
reduce.
• PIG
• Uses a dataflow scripting language
• IMPALA
• Open source project created by cloudier
• Very similar to HiveQL.Produces much faster.
Hadoop Ecosystem
• FLUME
• Import data into HDFS as it is generated
• Log files from a Web Server
• Sqoop
• Import data from tables in a OLTP into HDFS
• Populate database tables from files in HDFS
• Oozi
• Developers create a workflow of MapReduce Jobs
Hadoop Ecosystem
• HBASE
• HADOOP DATABASE
• NOSQL DATASTORE
• HUGE DATA STORE,GB,TB,PB
• Query Language get/put/scan
• Read/write Throughput Millions of query ps ,rdbms
is 1000s queries/second
Big Data
• Finance ,Fraud detection,Customer risk analysis
• Retail, Product recommendation,buy and discount
• Advertising,More effective web ads
• Defense
• Telco
• Healthcare
Analyzing Twitter Data
• https://github.com/cloudera/cdh-twitter-
example
Career Path
• Develop with Hadoop
• Hadoop Administration
• Hadoop for Data Scientists & Analysts
Zekeriya Beşiroğlu
http://zekeriyabesiroglu.com
http://twitter.com/zbesiroglu
http://bilginc.com
http://troug.org
mail to:zekeriyab@bilginc.com
zekeriyabesiroglu@gmail.com

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

  • 1.
  • 2.
    Zekeriya Besiroglu • BilgincIT Academy - Expert Consultant • + 16 IT • +14 ORACLE DB/DWH • +7 WEBLOGIC • +3 BIG DATA • TROUG • Speaker
  • 3.
  • 4.
    DATA TRENS - Facebookhas around 60 PB warehouse and it’s constantly growing - Twitter messages are 140 bytes each generating 8TB data per day. -Data is more than doubling every year.
 -Almost 80% of data will be unstructured data. -Amazon: 35% of product sales come from product recommendations
  • 5.
    New Type ofDATA? • Sentiment : Understand how your customers feel about your products / company • Sensor/Machine:Discover patters in data streaming automatically from sensors and machines. • Unstructured: text,video,pictures. • Server Logs:Search logs find pattern • Geographic:Analyze location-based data • Clickstream:Capture and analyze website visitors data
  • 6.
  • 7.
    Capacity vs Cost YearCapacity(GB) Cost per GB(USD) 1990 0.10 $4000 1997 2 $150 2002 80 $3.75 2007 750 $0.35 2012 3.000 $0.05 2015 10.000 $0.02
  • 8.
    What is BigData • Big Data is When the Volume,Velocity,Variety of data gets to the point where it is too difficult/ expensive for traditional systems to work with.
  • 9.
  • 10.
    Traditional Large scale ComputingSystem Problems • Computation has been processor bound • Relatively small amount of data • Complex processing • Need bigger computers • More memory,More/fast processor
  • 11.
    Better Solution • DistributedSystems- Multiple machine run for single job Problem Of Distributed Systems Data Stored central location Data Copied processor runtime
  • 12.
    Todays • Total Datasize PetaBytes • Daily Terabytes We Need New Solution HADOOP
  • 13.
    HADOOP • Distribute theData when it is stored SPARK Data is Distributed in Memory
  • 14.
  • 15.
    Hadoop • Hadoop consistof two component • HDFS • Map Reduce • Hadoop ecosystem • Pig,Hive,Hbase,Flume,Oozie,Sqoop,etc
  • 16.
    Traditional ETL Source Layer StructuredData DWH Data Mart ETL/ELT ETL/ELT Hadoop ETL Source Layer Structured Data UnStructed Data DWH Data MartHADOOP
  • 17.
    HDFS • Hadoop DistributedFile System:Storing data • Data Split into blocks. 64 Mb… • Each Block replicated e.g 3 times. replicas store different nodes. • Based on Google File system • ext3,ext4,xfs • No random writes allowed. Prefer large streaming reads
  • 18.
  • 19.
    HDFS • hadoop fs-ls (user home directory) • hadoop fs -ls / (root directory) • hadoop fs -cat /user/zekeriya/deneme.txt • hadoop fs -mkdir • hadoop fs -rm -r veri
  • 20.
    MapReduce • Process Datain the Hadoop Cluster • Two Stage MAP and REDUCE
  • 21.
    MAPREDUCE map(String input_key, Stringinput_value) foreach word w in input_value: emit(w, 1) reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count) (1000,’Galatasaray sampiyon olur’) (2000,’beşiktas sampiyon olur’) (2200,’Galatasaray Türkiyedir’)
  • 22.
    MAPREDUCE Output Mapper (‘Galatasaray’, 1),(‘sampiyon’, 1), (‘olur’, 1), (‘beşiktas’, 1), (‘sampiyon, 1), (‘olur’, 1), (‘Galatasaray’, 1), (‘Türkiyedir’, 1) Intermediate Data Reducer’a gönderilen (‘Galatasaray’,[1,1]) (‘sampiyon’,[1,1]) (‘olur’,[1]) (‘beşiktas’,[1]) (‘Türkiyedir’,[1]) Reducer’ın son cıktısı (‘Galatasaray’,2) (‘sampiyon’,2) (‘olur’,1) (‘beşiktas’,1) (‘Türkiyedir’,1)
  • 23.
    Hadoop Ecosystem • HIVE •LIKE SQL • User query data in hadoop cluster without knowing Java and Map reduce. • PIG • Uses a dataflow scripting language • IMPALA • Open source project created by cloudier • Very similar to HiveQL.Produces much faster.
  • 24.
    Hadoop Ecosystem • FLUME •Import data into HDFS as it is generated • Log files from a Web Server • Sqoop • Import data from tables in a OLTP into HDFS • Populate database tables from files in HDFS • Oozi • Developers create a workflow of MapReduce Jobs
  • 25.
    Hadoop Ecosystem • HBASE •HADOOP DATABASE • NOSQL DATASTORE • HUGE DATA STORE,GB,TB,PB • Query Language get/put/scan • Read/write Throughput Millions of query ps ,rdbms is 1000s queries/second
  • 26.
    Big Data • Finance,Fraud detection,Customer risk analysis • Retail, Product recommendation,buy and discount • Advertising,More effective web ads • Defense • Telco • Healthcare
  • 27.
    Analyzing Twitter Data •https://github.com/cloudera/cdh-twitter- example
  • 28.
    Career Path • Developwith Hadoop • Hadoop Administration • Hadoop for Data Scientists & Analysts
  • 29.