Have you ever heard the buzzword "big data"? Big data is briefly described to collect massive amounts of data and extract all the small details and larger trends that are available. Summarize the output and generate important insight about customers and competitors.
Enterprises seem to have sensed that something is in the air and have started to shop technology. So what has the world to offer for enterprises that have an unknown amount of petabytes flowing through their systems on a daily basis? There are a few options, but really few that can catch up with the popularity of Hadoop. Hadoop can store and process large amounts of data. It has a large and diverse toolset for integrations, operations and processing and it is open source!
2. • Tarjei Romtveit
• Co-founder of Monokkel AS
• Former CTO – Integrasco AS
• My story with Hadoop
www.monokkel.io
3. • Daglig leder i Monokkel AS
• Tidligere COO i Integrasco AS
• Persistering, Prosessering og Presentasjon av data
Persistering – Prosessering – Presentasjon
4. Bombshell
If you work with data today and not start to
learn the Hadoop ecosystem: You may be
unemployed soon
5. Agenda
• Context – Big Data and how to handle it
• What is Hadoop?
• Demo
• Distributions and/or demo
• “Deepdive” into Hadoop - Architecure
– HDFS
– YARN
– MapReduce
• Languages and ecosystem
6. What we not will cover
• Security
• Integrations with database X or system Y
• Running Hadoop in production
20. How can the CEO manage his
problem?
• Get control over the data
• Implement analytical
processes to aid sales
21.
22. The data he need to handle
• Volume – Gigabytes/Terabyte
• Variety – Click stream, Voice, emails, sensor data,
social data, different languages, timestamp data,
transactional data, third party data
• Variability – Various quality
• Velocity – MB per second
23. The data he need to handle
• Veracity / Data quality – Inconsistent data quality
• Complexity – Many legacy domain models
24. How to handle ?
Web
Emails
Sensors
Social
Processing
RDBMS
Search
25. How to understand ?
Web
Emails
Sensors
Social
Processing
RDBMS
Search
31. Distributions
• ”Stable” compilation of the Hadoop Ecosystem
• Operational tools
• Integration tools and frameworks
• Data governance and data management tools
• Security
33. HADOOP
An operating system for data
Layman’s terms
• Store huge files (unstructured) on many
machines
• Query and modify data
• Can run sophisticated analytics on top
34. How to start:
Alt 1
• https://hadoop.apache.org/
• Getting Started
• Download
• Unzip
• bin/hadoop <commandline arguments>
Alt 2
• http://hortonworks.com/products/hortonworks-sandbox/#install
• Install VMWare Player or VirtualBox
• Download image (6 GB)
• Install and run (give it lots of memory)
35. DEMO
– Transform and modify data
– Machine learning with Spark
– Integrate with ElasticSearch
NEXT: ARCHITECHTURE AND HOW IT WORKS
48. • HDFS blocks are immutable you can not change them!
• Deletes and updates are written as new blocks
• The node name takes care of overwriting deleted
blocks
• Small files are consuming a lot of name node memory
HDFS Delete/Update
54. YARN
D1 D2
Node Manager Node Manager
Resource Manager
Scheduler
Applications manager
Application Master
AM to RM: “document1” is
located on d1 and d2 and I
need X Gb RAM
57. YARN + HDFS
• Blocks are immutable. This enables high write speeds
• Data is schema free! You can store any data you want
• Data locality is what differentiates HDFS from other data
storage
• You can read massive amounts of data only limited by
disk read speeds
70. How to run
$ bin/hadoop jar wc.jar WordCount /hdfs/dir/in /hdfs/dir/out
71. MapReduce
• Mappers and reducers are distributed in YARN
containers
• Chaining of MapReduce jobs make them slow
• Easy to scale but difficult to code
• … use the data DSL languages instead
76. Hive/Drill/Spark
SQL
• Declarative / SQL-like languages
• Great for
• Column data / Database dumps
• Aggregations
• Connect BI tools and Dashboards
• Data Warehouse for Hadoop++
77. Spark
• Core language (runs in YARN or standalone)
• Great for
• Anything that MapReduce can do
• Analytics, Machine Learning
• In memory and languages in Java, Scala and
Python
78. Summary
• Hadoop is designed to handle/process massive amounts of data
through HDFS and/or YARN
• The data do not need to be structured before it is stored in HDFS
• Hadoop is an ecosystem and have languages/frameworks for data
extraction, data management, data analysis and data integration
• It is most convenient to begin with Hadoop by testing distributions.
E.g. Hortonworks, Cloudera, MapR etc.
• Learn MapReduce and learn to understand languages and a few
integration tools
- Startet med å bygge distribuerte systemer for store mengder data
Hoppet på Hadoop da det skulle løse alle problemer ca 2009/2010
Hoppet av igjen
Hopper på igjen nå
Hvor mange jobber med data
Hvor mange har jobbet med Hadoop
Hvor mange har jobbet med ElasticSearch
Hvor mange er konsulenter
Hvor mange konsuleter/ansatt I industrien/olje/manufacturing
Hvor mange konsuleter/ansatt I merkantile/handel/service/IT
Hvor mange konsuleter/ansatt I statlig
Noen med erfaring med Hadoop?
Detter er det jeg forbinder med Big Data akkurat nå
Veldig mye buzz… men la oss se hva det er i kjærnen og hvor Hadoop kommer inn i dette bilde
Doug Laney the inventor of big data back in 2001
MASSE KJEDELIGE ORD… LA OSS PRØVE Å SE PÅ ET EKSEMPEL
Volume –
Variety – Many datasets
Velocity – The speed of generation of data
Variability – Data can be inconsise and come in various form
Veracity – Quality of data
Complexity
Doug Laney the inventor of big data back in 2001
Doug Laney the inventor of big data back in 2001
Clickstream data
Ratings
Clickstream data
Ratings
External agreements on ratings and traffic
-Stuepiken is registering all activities
-IoT
-Stuepiken is registering all activities
-IoT
Doug Laney the inventor of big data back in 2001
An OpenSource operationg system for data
2002: Open source crawler Nutch by Dough Cutting and Mike Cafarella: The internet crawler. Web was maximumily 1 billion pages large. Limited scalability capabilities.
2003: Google releases their GFS paper for massively distributed filesystem.. Cutting and Cafarella incorporates the filesystem into Nutch
2004: Google releases their Map Reduce paper for massively parallell computing. This is incorporated into Nutch as well
2006: Yahoo hires Dough Cutting and the filesystem and Map Reduce component is extracted into the Hadoop project from the Nutch project.
2002: Open source crawler Nutch by Dough Cutting and Mike Cafarella: The internet crawler. Web was maximumily 1 billion pages large. Limited scalability capabilities.
2003: Google releases their GFS paper for massively distributed filesystem.. Cutting and Cafarella incorporates the filesystem into Nutch
2004: Google releases their Map Reduce paper for massively parallell computing. This is incorporated into Nutch as well
2006: Yahoo hires Dough Cutting and the filesystem and Map Reduce component is extracted into the Hadoop project from the Nutch project.
2008: Hadoop was storing all data. Even financial data was trusted to Hadoop
2008: Cloudera was the first commercial company that supported Hadoop
2011: 42 000 nodes storing petabytes of data
2011: Hortonworks was spun out of Yahoo as hadoop company. This company only focuses on the open source software from with its origin @ yahoo
2011 – First feature complete 1.0 version of Hadoop. MapReduce and HDFS is tighly integrated in 1.0 and pre versions
2013 – First large refactor of the operating system. Map Reduce is detached and Hadoop is more generalized to handle different processing paradigms
Data Nodes contains disks only
Data Nodes contains disks only
Scheduler is allocating based on information available from the node
ApplicationsManager track the state of all applications (managers) in the cluster
Node Managers constantly updates the ResourceManager with the current resource situatuon
Node Managers start the ApplicationMaster and Container
Application Masters are negotiating resources and allocates more containers if allowed
Node Managers constantly updates the ResourceManager with the current resource situatuon
Node Managers start the ApplicationMaster and Container
Application Masters are negotiating resources and allocates more containers if allowed
Application Masters are negotiating resources and allocates more containers if allowed. CPU cores, and Memory is requested, and that my file is located on D2
The application started by the Node Manager does not need to be Java.
De store selskapene: Spotify, Google, Netflix, === disruptorene