Your SlideShare is downloading. ×
Cloud computing era
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloud computing era

6,065
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
6,065
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
41
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cloud Computing EraTrend Micro
  • 2. Three Major Trends to Chang the World Cloud Computing Big Data Mobile
  • 3. What is Cloud Computing? National Institute of Standards and Technology (NIST) definition : Essential Characteristics Service Models Deployment Models Provide scalable and elastic IT relative functions to users via as-a- service business models and internet technologies
  • 4. It’s About the Ecosystem Structured, Semi-structured Cloud Computing SaaS Enterprise Data Warehouse PaaS IaaS Generate Big Data Lead Business Insights create Competition, Innovation, Productivity
  • 5. What is Big Data?
  • 6. What is the problem• Getting the data to the processors becomes the bottleneck• Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  • 7. The Era of Big Data – Are You Ready Data for business commercial analysis • 2011: multi-terabyte (TB) • 2020: 35.2 ZB (1 ZB = 1 billion TB)
  • 8. Who Needs It? Enterprise Database HadoopWhen to use? When to use?• Ad-hoc Reporting (<1sec) • Affordable Storage/Compute• Multi-step Transactions • Unstructured or Semi-structured• Lots of Inserts/Updates/Deletes • Resilient Auto Scalability
  • 9. Hadoop!
  • 10. – inspired by• Apache Hadoop project – inspired by Googles MapReduce and Google File System papers.• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware• Open Source Software + Hardware Commodity – IT Costs Reduction
  • 11. Hadoop Core MapReduce HDFS © 2011 Cloudera, Inc. All Rights Reserved.
  • 12. HDFS• Hadoop Distributed File System• Redundancy• Fault Tolerant• Scalable• Self Healing• Write Once, Read Many Times• Java API• Command Line Tool © 2011 Cloudera, Inc. All Rights Reserved.
  • 13. MapReduce • Two Phases of Functional Programming • Redundancy • Fault Tolerant • Scalable • Self Healing • Java API13 © 2011 Cloudera, Inc. All Rights Reserved.
  • 14. Hadoop Core JavaJava MapReduce HDFS Java Java 14 © 2011 Cloudera, Inc. All Rights Reserved.
  • 15. Word Count Example Key: offset Value: line Key: word Key: word Value: count Value: sum of count0:The cat sat on the mat22:The aardvark sat on the sofa
  • 16. The Hadoop Ecosystems
  • 17. The Ecosystem is the System• Hadoop has become the kernel of the distributed operating system for Big Data• No one uses the kernel alone• A collection of projects at Apache
  • 18. Relation Map Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 19. Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 20. What is ZooKeeper• A centralized service for maintaining – Configuration information – Providing distributed synchronization• A set of tools to build distributed applications that can safely handle partial failures• ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  • 21. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 22. What’s the problem for data collection• Data collection is currently a priori and ad hoc• A priori – decide what you want to collect ahead of time• Ad hoc – each kind of data source goes through its own collection path
  • 23. (and how can it help?)• A distributed data collection service• It efficiently collecting, aggregating, and moving large amounts of data• Fault tolerant, many failover and recovery mechanism• One-stop solution for data collection of all formats
  • 24. Flume ArchitectureLog Log ...Flume Node Flume NodeHDFS © 2011 Cloudera, Inc. All Rights Reserved.
  • 25. Flume Sources and Sinks• Local Files• HDFS• Stdin, Stdout• Twitter• IRC• IMAP © 2011 Cloudera, Inc. All Rights Reserved.
  • 26. Sqoop• Easy, parallel database import/export• What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  • 27. Sqoop HDFS Sqoop RDBMS28 © 2011 Cloudera, Inc. All Rights Reserved.
  • 28. Sqoop Examples $ sqoop import --connect jdbc:mysql://localhost/world -- username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He rat,AFG,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ...29 © 2011 Cloudera, Inc. All Rights Reserved.
  • 29. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 30. Why Hive and Pig?• Although MapReduce is very powerful, it can also be complex to master• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code• Many organizations have programmers who are skilled at writing code in scripting languages• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  • 31. Hive – Developed by• What is Hive? – An SQL-like interface to Hadoop• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage• Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 32. Hive SQL Hive MapReduce33 © 2011 Cloudera, Inc. All Rights Reserved.
  • 33. Pig – Initiated by• A high-level scripting language (Pig Latin)• Process data one step at a time• Simple to write MapReduce program• Easy understand• Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 34. Pig Script Pig MapReduce © 2011 Cloudera, Inc. All Rights Reserved.
  • 35. Hive vs. Pig Hive PigLanguage HiveQL (SQL-like) Pig Latin, a scripting languageSchema Table definitions A schema is optionally defined that are stored in a at runtime metastoreProgrammait Access JDBC, ODBC PigServer
  • 36. WordCount Example• Input Hello World Bye World Hello Hadoop Goodbye Hadoop• For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>• the reduce just sums up the values < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 37. WordCount Example In MapReducepublic class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }}public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);}
  • 38. WordCount Example By PigA = LOAD wordcount/input USING PigStorage as (token:chararray);B = GROUP A BY token;C = FOREACH B GENERATE group, COUNT(A) as count;DUMP C;
  • 39. WordCount Example By HiveCREATE TABLE wordcount (token STRING);LOAD DATA LOCAL INPATH ’wordcount/inputOVERWRITE INTO TABLE wordcount;SELECT count(*) FROM wordcount GROUP BY token;
  • 40. The Story So Far SQL Hive Pig Script Java MapReduce Java HDFS Sqoop Flume SQL RDBMS FS Posix41 © 2011 Cloudera, Inc. All Rights Reserved.
  • 41. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 42. Structured-data vs Raw-data
  • 43. I – Inspired by• Coordinated by Zookeeper• Low Latency• Random Reads And Writes• Distributed Key/Value Store• Simple API – PUT – GET – DELETE – SCAN
  • 44. Hbase – Data Model• Cells are “versioned”• Table rows are sorted by row key• Region – a row range [start-key:end-key]
  • 45. HBase Exampleshbase> create mytable, mycf‘hbase> listhbase> put mytable, row1, mycf:col1, val1‘hbase> put mytable, row1, mycf:col2, val2‘hbase> put mytable, row2, mycf:col1, val3‘hbase> scan mytable‘hbase> disable mytable‘hbase> drop mytable © 2011 Cloudera, Inc. All Rights Reserved.
  • 46. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 47. What is ?• A Java Web Application• Oozie is a workflow scheduler for Hadoop• Crond for Hadoop• Triggered Job 1 Job 2 – Time – Data Job 3 Job 4 Job 5
  • 48. Oozie Features• Component Independent – MapReduce – Hive – Pig – SqoopStreaming © 2011 Cloudera, Inc. All Rights Reserved.
  • 49. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 50. What is• Machine-learning tool• Distributed and scalable machine learning algorithms on the Hadoop platform• Building intelligent applications easier and faster
  • 51. Mahout Use Cases• Yahoo: Spam Detection• Foursquare: Recommendations• SpeedDate.com: Recommendations• Adobe: User Targetting• Amazon: Personalization Platform © 2011 Cloudera, Inc. All Rights Reserved.
  • 52. Hue – developed by• Hadoop User Experience• Apache Open source project• HUE is a web UI for Hadoop• Platform for building custom applications with a nice UI library
  • 53. Hue• HUE comes with a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  • 54. Hue: File Browser UI
  • 55. Hue: Beewax UI
  • 56. Use case Example• Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  • 57. ConclusionToday, we introduced:• Why Hadoop is needed• The basic concepts of HDFS and MapReduce• What sort of problems can be solved with Hadoop• What other projects are included in the Hadoop ecosystem
  • 58. Recap – Hadoop Ecosystem Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 59. Trend Micro Smart ProtectionNetwork (SPN) Case Study
  • 60. Collaboration in the underground
  • 61. Network Threats Shows Explosive GrowthThreats on the network like variants of the virus, spams, unknown downloadsources escapes the detection of the traditional security system and continues toshow explosive growth. New Unique Malware Discovered 1M unique Malwares every month
  • 62. New Design Concept for Threat Intelligence CDN / xSP Human Intelligence Honeypot Web Crawler Trend Micro Mail Protection Trend Micro Trend Micro Endpoint Protection Web Protection 150M+ Worldwide Endpoints/Sensors
  • 63. Challenges We Are Faced The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!
  • 64. Issues to Address Threat Raw Data Information Intelligence/Solution Volume: Infinite Time: No Delay Target: Keep Changing Threats
  • 65. SPN Feedback SPAM CDN Log SPN High Level Architecture HTTP POST L4 Log Log Receiver Receiver L4 Log Post Log Post Log Post Processing Processing ProcessingSPN infrastructure Adhoc-Query (Pig) MapReduce HBase Circus Hadoop Distributed File System (Ambari) (HDFS) Feedback Information Message BusApplication Email Reputation Service Web Reputation File Reputation Service Service
  • 66. Trend Micro Big Data process capacityDaily amount of SPN data to be processed• 8.5 billions Web Reputation queries• 3 billions Email Reputation queries• 7 billions File Reputation queries• Process 6 TB worldwide raw logs• 150 millions End-point connections
  • 67. Trend Micro: Web Reputation Services Technology Process Operation Trend Micro User Traffic | HoneypotProducts / Technology 8 billions/day Akamai 40% filteredCDN Cache 4.8 billions/day Rating Server for Known 15 MinutesHigh Throughput Web Service Threats 82% filtered Unknown & PrefilterHadoop Cluster 860 millions/day Page DownloadWeb Crawling 99.98% filtered Threat AnalysisMachine LearningData Mining 25,000 malicious URL /dayBlock malicious URL within 15 minutes once it goes online!
  • 68. Big Data Cases
  • 69. Google vs. Hadoop EcosystemChubby vs. ZookeeperMapReduce vs. MapReduceBigTable vs. HBaseGFS vs. HDFS Ref: Google Cluster
  • 70. Pioneer of Big Data Infrastructure – Google
  • 71. Hbase use Case@Facebook - Messages HBase Use Cases @ Facebook Facebook Insights Operational Data Store Self-service More Analytics/Hashout apps Messages Hashout Site Integrity 2010 2011 2012 2013 Social Graph Search Indexing Realtime Hive Updates Cross-system Tracing … and more
  • 72. Flagship App:Facebook MessagesMonthly data volume prior to launch• Monthly data volume prior to launch 15B x 1,024 bytes = 14TB 120B x 100 bytes = 11TB
  • 73. Facebook Messages Nowbook Messages NOW •StatsQuick Stats Messages Chats – 11B+ messages/day • 90B+ data accesses+ messages/day • Peak:1.5M ops/sec0B+ data~55% Read, 45% Write • accesseseak: 1.5M ops/sec55%Rd, 45% Wr data – 20PB+ of total Emails SMS • Grows 400TB/monthB+ of total datarows 400TB/month
  • 74. Facebook Messages:Requirements• Very High Write Volume – Previously, chat was not persisted to disk• Ever-growing data sets(Old data rarely gets accessed)• Elasticity & automatic failover• Strong consistency within a single data center• Large scans/map-reduce support for migrations & schema conversions• Bulk import data
  • 75. Physical Multi-tenancy• Real-time Ads Insights – Real-time analytics for social plugins on top of Hbase – Publishers get real-time distribution/engagement metrics: • # of impressions, likes • analytics by domain/URL/demographics and time periods – Uses HBase capabilities: • Efficient counters (single-RPC increments) • TTL for purging old data – Needs massive write throughput & low latencies • Billions of URLs • Millions of counter increments/second• Operational Data Store
  • 76. Facebook Open Source Stack• Memcached --> App Server Cache• ZooKeeper --> Small Data Coordination Service• HBase --> Database Storage Engine• HDFS --> Distributed FileSystem• Hadoop --> Asynchronous Map-Reduce Jobs
  • 77. Questions?
  • 78. Thank you!

×