Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

3,389 views

Published on

What is Bigdata. Big data. what is Hadoop. Hadoop Ecosystem. Big datayı hangi sektörlerde kullanabiliriz. Twitter Analiz.

Published in: Software

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

  1. 1. Big Data Zekeriya Beşiroğlu http://zekeriyabesiroglu.com http://bilginc.com http://twitter.com/zbesiroglu
  2. 2. Zekeriya Besiroglu • Bilginc IT Academy - Expert Consultant • + 16 IT • +14 ORACLE DB/DWH • +7 WEBLOGIC • +3 BIG DATA • TROUG • Speaker
  3. 3. Bilginc IT Academy
  4. 4. DATA TRENS - Facebook has around 60 PB warehouse and it’s constantly growing - Twitter messages are 140 bytes each generating 8TB data per day. -Data is more than doubling every year.
 -Almost 80% of data will be unstructured data. -Amazon: 35% of product sales come from product recommendations
  5. 5. New Type of DATA? • Sentiment : Understand how your customers feel about your products / company • Sensor/Machine:Discover patters in data streaming automatically from sensors and machines. • Unstructured: text,video,pictures. • Server Logs:Search logs find pattern • Geographic:Analyze location-based data • Clickstream:Capture and analyze website visitors data
  6. 6. Big Data https://www.youtube.com/watch?v=1GU4Imbo6R8
  7. 7. Capacity vs Cost Year Capacity(GB) Cost per GB(USD) 1990 0.10 $4000 1997 2 $150 2002 80 $3.75 2007 750 $0.35 2012 3.000 $0.05 2015 10.000 $0.02
  8. 8. What is Big Data • Big Data is When the Volume,Velocity,Variety of data gets to the point where it is too difficult/ expensive for traditional systems to work with.
  9. 9. 3Vs of Big Data
  10. 10. Traditional Large scale Computing System Problems • Computation has been processor bound • Relatively small amount of data • Complex processing • Need bigger computers • More memory,More/fast processor
  11. 11. Better Solution • Distributed Systems- Multiple machine run for single job Problem Of Distributed Systems Data Stored central location Data Copied processor runtime
  12. 12. Todays • Total Data size PetaBytes • Daily Terabytes We Need New Solution HADOOP
  13. 13. HADOOP • Distribute the Data when it is stored SPARK Data is Distributed in Memory
  14. 14. RDBMS vs HADOOP
  15. 15. Hadoop • Hadoop consist of two component • HDFS • Map Reduce • Hadoop ecosystem • Pig,Hive,Hbase,Flume,Oozie,Sqoop,etc
  16. 16. Traditional ETL Source Layer Structured Data DWH Data Mart ETL/ELT ETL/ELT Hadoop ETL Source Layer Structured Data UnStructed Data DWH Data MartHADOOP
  17. 17. HDFS • Hadoop Distributed File System:Storing data • Data Split into blocks. 64 Mb… • Each Block replicated e.g 3 times. replicas store different nodes. • Based on Google File system • ext3,ext4,xfs • No random writes allowed. Prefer large streaming reads
  18. 18. HDFS
  19. 19. HDFS • hadoop fs -ls (user home directory) • hadoop fs -ls / (root directory) • hadoop fs -cat /user/zekeriya/deneme.txt • hadoop fs -mkdir • hadoop fs -rm -r veri
  20. 20. MapReduce • Process Data in the Hadoop Cluster • Two Stage MAP and REDUCE
  21. 21. MAPREDUCE map(String input_key, String input_value) foreach word w in input_value: emit(w, 1) reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count) (1000,’Galatasaray sampiyon olur’) (2000,’beşiktas sampiyon olur’) (2200,’Galatasaray Türkiyedir’)
  22. 22. MAPREDUCE Output Mapper (‘Galatasaray’, 1), (‘sampiyon’, 1), (‘olur’, 1), (‘beşiktas’, 1), (‘sampiyon, 1), (‘olur’, 1), (‘Galatasaray’, 1), (‘Türkiyedir’, 1) Intermediate Data Reducer’a gönderilen (‘Galatasaray’,[1,1]) (‘sampiyon’,[1,1]) (‘olur’,[1]) (‘beşiktas’,[1]) (‘Türkiyedir’,[1]) Reducer’ın son cıktısı (‘Galatasaray’,2) (‘sampiyon’,2) (‘olur’,1) (‘beşiktas’,1) (‘Türkiyedir’,1)
  23. 23. Hadoop Ecosystem • HIVE • LIKE SQL • User query data in hadoop cluster without knowing Java and Map reduce. • PIG • Uses a dataflow scripting language • IMPALA • Open source project created by cloudier • Very similar to HiveQL.Produces much faster.
  24. 24. Hadoop Ecosystem • FLUME • Import data into HDFS as it is generated • Log files from a Web Server • Sqoop • Import data from tables in a OLTP into HDFS • Populate database tables from files in HDFS • Oozi • Developers create a workflow of MapReduce Jobs
  25. 25. Hadoop Ecosystem • HBASE • HADOOP DATABASE • NOSQL DATASTORE • HUGE DATA STORE,GB,TB,PB • Query Language get/put/scan • Read/write Throughput Millions of query ps ,rdbms is 1000s queries/second
  26. 26. Big Data • Finance ,Fraud detection,Customer risk analysis • Retail, Product recommendation,buy and discount • Advertising,More effective web ads • Defense • Telco • Healthcare
  27. 27. Analyzing Twitter Data • https://github.com/cloudera/cdh-twitter- example
  28. 28. Career Path • Develop with Hadoop • Hadoop Administration • Hadoop for Data Scientists & Analysts
  29. 29. Zekeriya Beşiroğlu http://zekeriyabesiroglu.com http://twitter.com/zbesiroglu http://bilginc.com http://troug.org mail to:zekeriyab@bilginc.com zekeriyabesiroglu@gmail.com

×