Your SlideShare is downloading. ×
Big Data: an introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Big Data: an introduction

5,436
views

Published on

Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a …

Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.

Published in: Data & Analytics

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,436
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
664
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data Big Data: an introduction Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk March 28, 2014 1 / 51
  • 2. Big Data Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 2 / 51
  • 3. Big Data Introduction: Big Data? Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 3 / 51
  • 4. Big Data Introduction: Big Data? Exponential growth of data © 2013 International Business Machines Corporation 4 Big Data: This is just the beginning 2010 VolumeinExabytes 9000 8000 7000 6000 5000 4000 3000 2015 Percentage of uncertain data Percentofuncertaindata 100 80 60 40 20 0 You are here Sensors & Devices VoIP Enterprise Data Social Media 4 / 51
  • 5. Big Data Introduction: Big Data? Big Data definition Definition of Big Data depends on who you ask: Big Data “Multiple terabytes or petabytes.” (according to some professionals) “I don’t know.” (today’s big may be tomorrow’s normal) “Relative to its context.” 5 / 51
  • 6. Big Data Introduction: Big Data? Quotes on Big Data “Big data” is a subjective label attached to situations in which human and technical infrastructures are unable to keep pace with a company’s data needs. It’s about recognizing that for some problems other storage solutions are better suited. 6 / 51
  • 7. Big Data Introduction: Big Data? The Three V’s Volume The amount of data is big. Variety Different kinds of data: structured semi-structured unstructured Velocity Speed-issues to consider: How fast is the data available for analysis? How fast can we do something with it? Other V’s: Veracity, Variability, Validity, Value,. . . 7 / 51
  • 8. Big Data Introduction: Big Data? Structured data Structured data Pre-defined schema imposed on the data Highly structured Usually stored in a relational database system Example numbers: 20, 3.1415,. . . dates: 21/03/1978 strings: ”Hello World” . . . Roughly 20% of all data out there is structured. 8 / 51
  • 9. Big Data Introduction: Big Data? Semi-structured data Semi-structured data Inconsistent structure. Cannot be stored in rows and tables in a typical database. Information is often self-describing (label/value pairs). Example XML, SGML,. . . BibTeX files logs tweets sensor feeds . . . 9 / 51
  • 10. Big Data Introduction: Big Data? Unstructured data Definition (Unstructured data) Lacks structure or parts of it lack structure. Example multimedia: videos, photos, audio files,. . . email messages free-form text word processing documents presentations reports . . . Experts estimate that 80 to 90 % of the data in any organization is unstructured. 10 / 51
  • 11. Big Data Introduction: Big Data? Data Storage and Analysis Storage capacity of hard drives has increased massively over the years. Access speeds have not kept up. Example (Reading a whole disk) Year Storage Capacity Transfer Speed Time 1990 1370 MB 4.4 MB/s ≈ 5 minutes 2010 1 TB 100 MB/s > 2.5 hours Solution: work in parallel! Using 100 drives (each holding 1/100th of the data), reading 1 TB takes less than 2 minutes. 11 / 51
  • 12. Big Data Introduction: Big Data? Working in parallel Problems 1 Hardware failure? 2 Combining data from different disks for analysis? Solutions 1 HDFS: Hadoop Distributed Filesystem 2 MapReduce: programming model 12 / 51
  • 13. Big Data Big Data Technology Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 13 / 51
  • 14. Big Data Big Data Technology Big Data Landscape 14 / 51
  • 15. Big Data Big Data Technology Hadoop Hadoop is VMware, but the other way around. 15 / 51
  • 16. Big Data Big Data Technology Hadoop as the opposite of a virtual machine VMware 1 take one physical server 2 split it up 3 get many small virtual servers Hadoop 1 take many physical servers 2 merge them all together 3 get one big, massive, virtual server 16 / 51
  • 17. Big Data Big Data Technology Hadoop: core functionality HDFS Self-healing, high-bandwidth, clustered storage. MapReduce Distributed, fault-tolerant resource management, coupled with scalable data processing. 17 / 51
  • 18. Big Data Big Data Technology HDFS architecture 18 / 51
  • 19. Big Data Big Data Technology MapReduce 19 / 51
  • 20. Big Data Big Data Technology MapReduce 20 / 51
  • 21. Big Data Big Data Technology Apache Hadoop essentials: technology stack 21 / 51
  • 22. Big Data Big Data Technology Pig MapReduce requires programmers think in terms of map and reduce functions, more than likely use the Java language. Pig provides a high-level language (Pig Latin) that can be used by Analysts Data Scientists Statisticians Etc. . . 22 / 51
  • 23. Big Data Big Data Technology Hive Originated at Facebook to analyze log data. HiveQL: Hive Query Language, similar to standard SQL. Queries are compiled into MapReduce jobs. Has command-line shell, similar to e.g. MySQL shell. 23 / 51
  • 24. Big Data Big Data Technology Example Hadoop distributions 24 / 51
  • 25. Big Data Big Data Technology NoSQL 25 / 51
  • 26. Big Data Big Data Technology RDBMS: Codd’s 12 rules Codd’s 12 rules A set of rules designed to define what is required from a database management system in order for it to be considered relational. Rule 0 The Foundation rule Rule 1 The Information rule Rule 2 The guaranteed access rule Rule 3 Systematic treatment of null values Rule 4 Active online catalog based on the relational model . . . . . . 26 / 51
  • 27. Big Data Big Data Technology ACID ACID A set of properties that guarantee that database transactions are processed reliably. Atomicity A transaction is all or nothing. Consistency Only transactions with valid data. Isolation Simultaneous transactions will not interfere. Durability Written transaction data stays there “forever” (even in case of power loss, crashes, errors,. . . ). 27 / 51
  • 28. Big Data Big Data Technology Scaling up What if you need to scale up your RDBMS in terms of dataset size, read/write concurrency? This usually involves breaking Codds rules, loosening ACID restrictions, forgetting conventional DBA wisdom, loose most of the desirable properties that made RDBMS so convenient in the first place. NoSQL to the rescue! 28 / 51
  • 29. Big Data Big Data Technology NoSQL NoSQL ‘Invented’ by Carl Strozzi in 1998 (for his file-based database) “Not only SQL” It’s NOT about saying that SQL should never be used, saying that SQL is dead. 29 / 51
  • 30. Big Data Big Data Technology NoSQL databases Four emerging NoSQL categories: 30 / 51
  • 31. Big Data Big Data Technology Us the right tool for the right job! http://db-engines.com/ 31 / 51
  • 32. Big Data Big Data in my company? Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 32 / 51
  • 33. Big Data Big Data in my company? Typical RDBMS scaling story 1. Initial Public Launch From local workstation → remotely hosted MySQL instance. 2. Service popularity ↑, too many reads hitting the database Add memcached to cache common queries. Reads are now no longer strictly ACID; cached data must expire. 3. Popularity ↑↑, too many writes hitting the database Scale MySQL vertically by buying a beefed-up server: 16 cores 128 GB of RAM banks of 15 k RPM hard drives    Costly 33 / 51
  • 34. Big Data Big Data in my company? Typical RDBMS scaling story 4. New features → query complexity ↑, now too many joins Denormalize your data to reduce joins. (Thats not what they taught me in DBA school!) 5. Rising popularity swamps the server; things are too slow Stop doing any server-side computations. 34 / 51
  • 35. Big Data Big Data in my company? Typical RDBMS scaling story 6. Some queries are still too slow Periodically prematerialize the most complex queries, and try to stop joining in most cases. 7. Reads are OK, writes are getting slower and slower. . . Drop secondary indexes and triggers (no indexes?). If you stay up at night worrying about your database (uptime, scale, or speed), you should seriously consider making a jump from the RDBMS world to HBase. 35 / 51
  • 36. Big Data Big Data in my company? Use-cases of Big Data ‘Core Big Data’ company Big Data crunching, hacking, processing, analyzing, . . . ‘General Big Data’ company Business Analytics improve decision-making, gain operational insights, increase overall performance, track and analyze shopping patterns, . . . Both Explore! Discover hidden gems! 36 / 51
  • 37. Big Data Big Data in my company? Some examples Intrusion detection based on server log data Real-time security analytics Fraud detection Customer behavior based sentiment analysis of social media Campaign analytics 37 / 51
  • 38. Big Data Big Data in my company? Big Data in your company 38 / 51
  • 39. Big Data IWT TETRA project Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 39 / 51
  • 40. Big Data IWT TETRA project IWT TETRA project Data mining: van relationele database naar Big Data. Dates Submitted: 12/03/2014 Notification of acceptance: July, 2014 Runs from 01/10/2014 – 01/10/2016 People involved Wannes De Smet (researcher) Bart Vandewoestyne (researcher) Johan De Gelas (project coordinator) Interested? → Come talk to us! 40 / 51
  • 41. Big Data IWT TETRA project Project plan, work packages RDBMS vs. Distributed Processing Technology Choice MapReduce & Alternatives Big Data Stack Analysis BI Optimization Distributed Processing Optimization Infrastructure & Cloud Analysis Dissemination 41 / 51
  • 42. Big Data IWT TETRA project WP1: RDBMS vs. Distributed Processing Key question When to switch from a ‘traditional’ technology to ‘Big Data’ technology? Evaluate traditional database systems (Virtuoso, VoltDB,. . . ) Find their limitations. Strengths? Weaknesses? 42 / 51
  • 43. Big Data IWT TETRA project WP2: Analyse Big Data technology stack Key idea Get acquinted with Hadoop and its most important software components. Find best way to setup, administer and use Hadoop. Get familiar with most important software components (Pig, Hive, HBase,. . . ). Find out how easy it is to integrate Hadoop into existing architectures. 43 / 51
  • 44. Big Data IWT TETRA project WP3: Alternatives for MapReduce Key question What are valuable alternatives for MapReduce? Faster querying (compared to Pig & Hive) Lightning-fast cluster computing Distributed and fault-tolerant realtime computation Apache Storm 44 / 51
  • 45. Big Data IWT TETRA project WP4: BI optimization Key questions Where can existing BI solutions be optimized? How can current BI solution interact with Big Data technology? Virtuoso, MS SQL Server 2014, VoltDB,. . . Apache Sqoop 45 / 51
  • 46. Big Data IWT TETRA project WP5: Distributed Processing optimization Key question Where can Big Data technology be performance tuned? How is the data stored? Optimal settings for Hadoop, MapReduce,. . . Benchmarks such as TestDFSIO, TeraSort, NNBench, MRBench,. . . 46 / 51
  • 47. Big Data IWT TETRA project WP6: Infrastructure & Cloud analysis Key question What hardware best fits the (Big Data) needs? Perform hardware monitoring. Analyze cloud solutions. Formulate best practices. Give advice on hardware choice. 47 / 51
  • 48. Big Data IWT TETRA project WP7: Dissemination & project follow-up Key idea Spread the message! Document case-studies. Prepare for education. Presentations at events. Blogs, articles,. . . Workshops 48 / 51
  • 49. Big Data Conclusions Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 49 / 51
  • 50. Big Data Conclusions Conclusions “Big” can be small too. The Big Data landscape is huge. The right tool for the right job! We can help → advice, case studies Your company can benefit from Big Data technology. Be brave in your quest. . . 50 / 51
  • 51. Big Data Conclusions Questions? Questions? johan@sizingservers.be wannes@sizingservers.be bart@sizingservers.be 51 / 51