This document provides an introduction and overview of big data technologies. It begins with an outline that covers introductions to big data, NoSQL databases, MapReduce and Hadoop, and Hive, HBase and Sqoop. It then discusses relational databases and SQL before introducing NoSQL databases. Key reasons for using NoSQL databases are explained, including improved scalability, lower costs, flexibility in data structures, and high availability. Examples of big data applications and the internet of things are also presented.
Big Data: Architectures and ApproachesThoughtworks
ThoughtWorkers David Elliman and Ashok Subramanian present how the big data world is moving quickly with predictions of amazing industry growth. For more information on how the 'Internet of Things' is playing an increasingly larger role, read David's blog post or watch the video from the London-based event. http://www.thoughtworks.com/insights/blog/big-data-and-internet-things
Slide presentasi ini dibawakan oleh Jony Sugianto dalam Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Big Data: Architectures and ApproachesThoughtworks
ThoughtWorkers David Elliman and Ashok Subramanian present how the big data world is moving quickly with predictions of amazing industry growth. For more information on how the 'Internet of Things' is playing an increasingly larger role, read David's blog post or watch the video from the London-based event. http://www.thoughtworks.com/insights/blog/big-data-and-internet-things
Slide presentasi ini dibawakan oleh Jony Sugianto dalam Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
"Big Data" is big business, but what does it really mean? How will big data impact industries and consumers? This slide deck goes through some of the high level details of the market and how it is revolutionizing the world.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
From the webinar presentation "Data Science: Not Just for Big Data", hosted by Kalido and presented by:
David Smith, Data Scientist at Revolution Analytics, and
Gregory Piatetsky, Editor, KDnuggets
These are the slides for David Smith's portion of the presentation.
Watch the full webinar at:
http://www.kalido.com/data-science.htm
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
데이터 과학자의 실체 The Reality of Data Scientist
전체 분석 과정에서 대부분은 데이터를 모으고 가공하는데 소요한다.
그리고 애플리케이션에 데이터를 적용하기 위해서는 테스팅이 가장 중요하다.
인간공학 전공자들을 대상으로 준비한 발표자료라서 '데이터 수집 및 클렌징'보다는 '테스트 (온라인 테스트)'에 초점을 두고 자료를 만들었습니다.
Rating Prediction using Deep Learning and SparkJongwook Woo
Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems.
Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
"Big Data" is big business, but what does it really mean? How will big data impact industries and consumers? This slide deck goes through some of the high level details of the market and how it is revolutionizing the world.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Presentation at Data ScienceTech Institute campuses, Paris and Nice, May 2016 , including Intro, Data Science History and Terms; 10 Real-World Data Science Lessons; Data Science Now: Polls & Trends; Data Science Roles; Data Science Job Trends; and Data Science Future
From the webinar presentation "Data Science: Not Just for Big Data", hosted by Kalido and presented by:
David Smith, Data Scientist at Revolution Analytics, and
Gregory Piatetsky, Editor, KDnuggets
These are the slides for David Smith's portion of the presentation.
Watch the full webinar at:
http://www.kalido.com/data-science.htm
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
데이터 과학자의 실체 The Reality of Data Scientist
전체 분석 과정에서 대부분은 데이터를 모으고 가공하는데 소요한다.
그리고 애플리케이션에 데이터를 적용하기 위해서는 테스팅이 가장 중요하다.
인간공학 전공자들을 대상으로 준비한 발표자료라서 '데이터 수집 및 클렌징'보다는 '테스트 (온라인 테스트)'에 초점을 두고 자료를 만들었습니다.
Rating Prediction using Deep Learning and SparkJongwook Woo
Distributed Deep Learning to predict Amazon review data rating in Spark using Analytics Zoo on AWS, which is published at "Rating Prediction using Deep Learning and Spark" at The 11th Internation Conference on Internet (ICONI 2019), Hanoi, Vietnam, Dec 15 - 18 2019
Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems.
Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLJongwook Woo
This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
1) Introduction to the key Big Data concepts
1.1 The Origins of Big Data
1.2 What is Big Data ?
1.3 Why is Big Data So Important ?
1.4 How Is Big Data Used In Practice ?
2) Introduction to the key principles of Big Data Systems
2.1 How to design Data Pipeline in 6 steps
2.2 Using Lambda Architecture for big data processing
3) Practical case study : Chat bot with Video Recommendation Engine
4) FAQ for student
Fixing data science & Accelerating Artificial Super Intelligence DevelopmentManojKumarR41
This presentation discusses Challenges, Problems, Issues, Measures, Mistakes, Opportunities, Ideas, Technologies, Research and Visions around Data Science
HashGraph, Data Mesh, Data Trajectories, Citrix HDX and Anonos BigPrivacy
Combination of these 5 and few other ideas will ultimately lead us to the VGB Platform. Will soon come up with other document explaining the vision and how exactly work on the vision to gradually develop this Platform, which fixes Data Science Efforts Globally.
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
These slides accompanied a June 4th, 2016 presentation made by Dan Gillean of Artefactual Systems at the Association of Canadian Archivists' 2016 Conference in Montreal, QC, Canada.
This presentation aims to examine several existing or emerging computing paradigms, with specific examples, to imagine how they might inform next-generation archival systems to support digital preservation, description, and access. Topics covered include:
- Distributed Version Control and git
- P2P architectures and the BitTorrent protocol
- Linked Open Data and RDF
- Blockchain technology
The session is part of an attempt by the ACA to create interactive "working sessions" at its conferences. Accompanying notes can be found at: http://bit.ly/tech-Proche
Participants were also asked to use the Twitter hashtag of #techProche for online interaction during the session.
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
Big Data Management: What's New, What's Different, and What You Need To KnowSnapLogic
This presentation is from a recorded webinar with 451 Research analyst and thought leader Matt Aslett for a discussion about the growing importance of the right data management best practices and techniques for delivering on the promise of big data in the enterprise. Matt reviews the big data landscape, how the data lake complements and competes with the data warehouse, and key takeaways as you move from big data test and development environments to production. You can watch the webinar here: http://bit.ly/25ShiQu
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Too often I hear the question “Can you help me with our data strategy?” Unfortunately, for most, this is the wrong request because it focuses on the least valuable component: the data strategy itself. A more useful request is: “Can you help me apply data strategically?” Yes, at early maturity phases the process of developing strategic thinking about data is more important than the actual product! Trying to write a good (must less perfect) data strategy on the first attempt is generally not productive –particularly given the widespread acceptance of Mike Tyson’s truism: “Everybody has a plan until they get punched in the face.” This program refocuses efforts on learning how to iteratively improve the way data is strategically applied. This will permit data-based strategy components to keep up with agile, evolving organizational strategies. It also contributes to three primary organizational data goals. Learn how to improve the following:
- Your organization’s data
- The way your people use data
- The way your people use data to achieve your organizational strategy
This will help in ways never imagined. Data are your sole non-depletable, non-degradable, durable strategic assets, and they are pervasively shared across every organizational area. Addressing existing challenges programmatically includes overcoming necessary but insufficient prerequisites and developing a disciplined, repeatable means of improving business objectives. This process (based on the theory of constraints) is where the strategic data work really occurs as organizations identify prioritized areas where better assets, literacy, and support (data strategy components) can help an organization better achieve specific strategic objectives. Then the process becomes lather, rinse, and repeat. Several complementary concepts are also covered, including:
- A cohesive argument for why data strategy is necessary for effective data governance
- An overview of prerequisites for effective strategic use of data strategy, as well as common pitfalls
- A repeatable process for identifying and removing data constraints
- The importance of balancing business operation and innovation
General overview of the Big Data Concept.
Presentation of the Hierarchical Linear Subspace Indexing Method to perform exact similarity search in high dimensional data
This slide present Data Analytics concept. Topics are level of analytics, CRISP-DM, data science use cases e.g., customer segmentation, churn prediction, product recommendation, demand forecasting
This slides present concept of Data Mining and Big Data Analytics. The topices are:
- Internet of Things (IoT)
- Data Science/Mining applications
- Data Science/Mining techniques including (1) Association, (2) Clustering, (3) Classification
- CRISP-DM: Cross Industry Standard Process for Data Mining
This presentation described Big Data concept. Then it shows example of applications in Banking. The presenter is Dr. Tuangtong Wattarujeekrit in Big Data Analytics Day event.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
26. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Big Data ประกอบด้วย 3 V
• Volume
• ข้อมูลมีจำนวนเพิ่มขึ้นอย่างมหาศาล
• Velocity
• ข้อมูลเพิ่มขึ้นอย่างรวดเร็ว
• Variety
• ข้อมูลมีความหลากหลายมากขึ้น
26
source: https://upxacademy.com/beginners-guide-to-big-data/
30. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
30
32. http://dataminingtrend.com http://facebook.com/datacube.th
What is Big Data?
• Huge volume of data
• ข้อมูลมีขนาดใหญ่มากๆ เช่น มีจำนวนเป็นพันล้านแถว (billion row) หรือ
เป็นล้านคอลัมน์ (million columns)
• Speed of new data creation and growth
• ข้อมูลเกิดขึ้นอย่างรวดเร็วมากๆ
• Complexity of data types and structures
• ข้อมูลมีความหลากหลาย ไม่ได้อยู่ในรูปแบบของตารางเท่านั้น อาจจะเป็น
รูปแบบของข้อความ (text) รูปภาพ (images) หรือ วิดีโอ (video clip)
32
46. http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
• Databases are made up of tables and each table is made up of
rows and columns
• SQL is a database interaction language that allows you to add,
retrieve, edit and delete information stored in databases
46
ID Mark Code Title
S103 72 DBS Database Systems
S103 58 IAI Intro to AI
S104 68 PR1 Programming 1
S104 65 IAI Intro to AI
S106 43 PR2 Programming 2
S107 76 PR1 Programming 1
S107 60 PR2 Programming 2
S107 35 IAI Intro to AI
47. http://dataminingtrend.com http://facebook.com/datacube.th
Relational database & SQL
• SQL primarily works with two types of operations to query data
• Read consists of the SELECT command, which has three
common clauses
• SELECT
• FROM
• WHERE
47image source:https://justbablog.files.wordpress.com/2017/03/sql_beginners.jpg
49. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• Relational databases have been the dominate type of database used
for application for decades.
• With the advent of the Web, however, the limitations of relational
databases became increasingly problematic.
• Companies such as Google, LinkedIn, Yahoo! and Amazon found that
supporting large numbers of users on the Web was different from
supporting much smaller numbers of business users.
49
51. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?
• Web application needed to support
• Large volumes of read and write operations
• Low latency response times
• High availability
• These requirement were difficult to realise using relational databases.
• There are limits to how many CPUs and memory can be supported in a
single server.
• Another option is to use multiple servers with a relational database.
• operating single RDBMS over multiple servers is a complex operation
51
53. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scalability is the ability to efficiently meet the needs for varying
workloads.
• For example, if there is a spike in traffic to a website, additional
servers can be brought online to handle the additional load.
• When the spike subsides and traffic returns to normal, some of
those additional servers can be shut down.
• Adding servers as needed is called scaling out.
53
55. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Scalability
• Scaling out is more flexible than scaling up.
• Servers can be added or removed as needed when scaling up.
• NoSQL are designed to utilise servers available in a cluster with
minimal intervention by database administrators.
55
56. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Cost
• Commercial software vendors employ a variety of licensing
models that include charging by
• the size of the server running the RDBMS
• the number of concurrent users on the database
• the number of named users allowed to use the software
• The major NoSQL databases are available as open source. It’s free to
use on as many servers of whatever size needed
56
58. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Flexibility
• Database designers expect to know at the start of a project all
the tables and columns that will be needed to support an
application.
• It is also commonly assumed that most of the columns in a table
will be needed by most of the rows.
• Unlike relational databases, some NoSQL databases do not
require a fixed table structure.
• For example, in a document database, a program could
dynamically add new attributes as needed without having to have a
database designer alter the database design.
58
59. http://dataminingtrend.com http://facebook.com/datacube.th
Why NoSQL?: Availability
• Many of us have come to expect websites and web applications
to be available whenever we want to use them.
• NoSQL databases are designed to take advantage of multiple,
low-cost servers.
• When one server fails or is taken out of service for maintenance,
the other servers in the cluster can take on the entire workload.
59
61. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Key-value databases are the simplest form of NoSQL
databases.
• These databases are modelled on two components:
keys and values
• Data is stored in a key-value pairs, where attribute is the Key
and content is the Value
• Data can only be queries and retrieved using the key only.
61
62. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• use cases
• caching data from
relational databases to
improve performance
• storing data from
sensors (IoT)
• software
• redis
• Amazon DynamoDB
62
3876941. accountNumber
Jane Washington1. Name
31.numItems
Loyalty Member1.custType
Keys Values
63. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Redis example (http://try.redis.io)
• Set or update value against a key:
• SET university "DPU" // set string
• GET university // get string
• HSET student firstName "Manee" // Hash – set field
value
• HGET student firstName // Hash – get field value
• LPUSH "alice:sales" "10" "20" // List create/append
• LSET "alice:sales" "0" "4" // List update
• LRANGE "alice:sales" 0 1 // view list
63
64. http://dataminingtrend.com http://facebook.com/datacube.th
Key-Value databases
• Set or update value against a key:
• SET quantities 1
• INCR quantities
• SADD "alice:friends" "f1" "f2" //Set – create/
update
• SADD "bob:friends" "f2" "f1" //Set – create/update
• Set operations:
• intersection
• SINTER "alice:friends" "bob:friends"
• union
• SUNION "alice:friends" “bob:friends"
64
66. http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• A document store allows the inserting, retrieving, and
manipulating of semi-structured data.
• Compared to RDBMS, the documents themselves act as
records (or rows), however, it is semi-structured as compared to
rigid RDBMS.
• It can store the data that have different set of data fields
(columns)
• Most of the databases available under this category use XML,
JSON
66
69. http://dataminingtrend.com http://facebook.com/datacube.th
Document Databases
• MongoDB examples
• Download MongoDB from https://www.mongodb.com/download-
center?jmp=nav#community
• MongoDB’s default data directory path is the absolute path datadb
on the drive from which you start MongoDB
• You can specify an alternate path for data files using the --dbpath
option to mongod.exe
• Import example data
69
"C:Program FilesMongoDBServer3.4binmongod.exe"
--dbpath d:testmongodbdata
mongoimport --db test --collection restaurants --drop --file
downloads/primer-dataset.json
74. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Store data as columns as opposed to rows that is prominent in
RDBMS
• A relational database shows the data as two-dimensional tables
comprising of rows and columns but stores, retrieves, and
processes it one row at a time
• A column-oriented database stores each column continuously.
i.e. on disk or in-memory each column on the left will be stored
in sequential blocks.
74
76. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Advantages of column-based tables:
• Faster Data Access:
• Only affected columns have to be read during the selection
process of a query. Any of the columns can serve as an index.
• Better Compression:
• Columnar data storage allows highly efficient compression
because the majority of the columns contain only few distinct
values (compared to number of rows).
76
77. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Advantages of column-based tables:
• Better parallel Processing:
• In a column store, data is already vertically partitioned. This
means that operations on different columns can easily be
processed in parallel.
• If multiple columns need to be searched or aggregated, each of
these operations can be assigned to a different processor core.
77
78. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• In case of analytic applications, where aggregations are used and
faster search & processing are required, row-based storage are not
good.
• In row based tables all data stored in a row has to be read even
though the requirement may be there to access data from a few
columns.
• Hence, these queries on huge amounts of data would take lots of
times.
• In columnar tables, this information is stored physically next to each
other, that significantly increases the speed of certain data queries.
78
79. http://dataminingtrend.com http://facebook.com/datacube.th
Column-oriented databases
• Column storage is most useful for OLAP queries (queries using
any SQL aggregate functions). Because, these queries get just
a few attributes from every data entry.
• But for traditional OLTP queries (queries not using any SQL
aggregate functions), it is more advantageous to store all
attributes side-by-side in row tables
79
85. http://dataminingtrend.com http://facebook.com/datacube.th
Graph databases
• Graph databases are the most specialized of the 4 NoSQL databases.
• Instead of modelling data using columns and rows, a graph database uses
structures called nodes and relationships.
• more formal discussions, they are called vertices and edges
• A node is an object that has an identifier and a set of attributes
• A relationship is a link between two nodes that contain attributes about that
relation.
• Graph databases are designed to model adjacency between objects. Every
node in the database contains pointers to adjacent objects in the database.
• This allows for fast operations that require following paths through a graph.
85
89. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Hadoop is composed of two primary components that
implement the basic concepts of distributed storage and
computation: HDFS and YARN
• HDFS (sometimes shortened to DFS) is the Hadoop Distributed
File System, responsible for managing data stored on disks
across the cluster.
• YARN acts as a cluster resource manager, allocating
computational assets (processing availability and memory on
worker nodes) to applications that wish to perform a distributed
computation.
89
91. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• HDFS and YARN work in concert to minimize the amount of
network traffic in the cluster primarily by ensuring that data is
local to the required computation.
• A set of machines that is running HDFS and YARN is known as a
cluster, and the individual machines are called nodes.
• A cluster can have a single node, or many thousands of nodes,
but all clusters scale horizontally, meaning as you add more
nodes, the cluster increases in both capacity and performance
in a linear fashion.
91
92. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Each node in the cluster is identified by the type of process that
it runs:
• Master nodes
• These nodes run coordinating services for Hadoop workers and
are usually the entry points for user access to the cluster.
• Worker nodes
• Worker nodes run services that accept tasks from master nodes
either to store or retrieve data or to run a particular application.
• A distributed computation is run by parallelizing the analysis
across worker nodes.
92
93. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• For HDFS, the master and worker services are as follows:
• NameNode (Master)
• Stores the directory tree of the file system, file metadata, and the
location of each file in the cluster.
• Clients wanting to access HDFS must first locate the appropriate
storage nodes by requesting information from the NameNode.
• DataNode (Worker)
• Stores and manages HDFS blocks on the local disk.
• Reports health and status of individual data stores back to the
NameNode
93
95. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• When data is accessed from HDFS
• a client application must first make a request to the NameNode to
locate the data on disk.
• The NameNode will reply with a list of DataNodes that store the
data.
• the client must then directly request each block of data from the
DataNode.
95
96. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• YARN has multiple master services and a worker service as
follows:
• ResourceManager (Master)
• Allocates and monitors available cluster resources (e.g.,
physical assets like memory and processor cores)
• handling scheduling of jobs on the cluster
• ApplicationMaster (Master)
• Coordinates a particular application being run on the cluster as
scheduled by the ResourceManager
96
99. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Clients that wish to execute a job
• must first request resources from the ResourceManager, which
assigns an application-specific ApplicationMaster for the duration
of the job.
• the ApplicationMaster tracks the execution of the job.
• the ResourceManager tracks the status of the nodes
• each individual NodeManager creates containers and executes
tasks within them
99
100. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop architecture
• Finally, one other type of cluster is important to note: a single node
cluster.
• In “pseudo-distributed mode” a single machine runs all Hadoop
daemons as though it were part of a cluster, but network traffic occurs
through the local loopback network interface.
• Hadoop developers typically work in a pseudo-distributed environment,
usually inside of a virtual machine to which they connect via SSH.
• Cloudera, Hortonworks, and other popular distributions of Hadoop
provide pre-built virtual machine images that you can download and
get started with right away.
100
101. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Distributed File System (HDFS)
• HDFS provides redundant storage for big data by storing that
data across a cluster of cheap, unreliable computers, thus
extending the amount of available storage capacity that a single
machine alone might have.
• HDFS performs best with modest number of very large files
• millions of large files (100 MB or more) rather than billions of smaller
files that might occupy the same volume.
• It is not a good fit as a data backend for applications that require
updates in real-time, interactive data analysis, or record-based
transactional support.
101
102. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Distributed File System (HDFS)
• HDFS files are split into blocks, usually of either 64MB or
128MB.
• Blocks allow very large files to be split across and distributed to
many machines at run time.
• Additionally, blocks will be replicated across the DataNodes.
• by default, the replication is three fold
• Therefore, each block exists on three different machines and three
different disks, and if even two node fail, the data will not be lost.
102
103. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Interacting with HDFS is primarily performed from the command
line using the script named hdfs. The hdfs script has the
following usage:
• The -option argument is the name of a specific option for the
specified command, and <arg> is one or more arguments that that
are specified for this option.
• For example, show help
103
$ hadoop fs [-option <arg>]
$ hadoop fs -help
104. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• List directory contents
• use -ls command:
• Running the -ls command on a new cluster will not return any
results. This is because the -ls command, without any
arguments, will attempt to display the contents of the user’s
home directory on HDFS.
• Providing -ls with the forward slash (/) as an argument displays the
contents of the root of HDFS:
104
$ hadoop fs -ls
$ hadoop fs -ls /
105. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Creating a directory
• To create the books directory within HDFS, use the -mkdir
command:
• For example, create books directory in home directory
• Use the -ls command to verify that the previous directories were
created:
105
$ hadoop fs -mkdir [directory name]
$ hadoop fs -mkdir books
$ hadoop fs -ls
106. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Copy Data onto HDFS
• After a directory has been created for the current user, data can
be uploaded to the user’s HDFS home directory with the -put
command:
• For example, copy book file from local to HDFS
• Use the -ls command to verify that pg20417.txt was moved to
HDFS:
106
$ hadoop fs -put [source file] [destination file]
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -ls books
107. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Retrieve (view) Data from HDFS
• Multiple commands allow data to be retrieved from HDFS.
• To simply view the contents of a file, use the -cat command. -cat
reads a file on HDFS and displays its contents to stdout.
• The following command uses -cat to display the contents of
pg20417.txt
•
107
$ hadoop fs -cat books/pg20417.txt
108. http://dataminingtrend.com http://facebook.com/datacube.th
Interacting with HDFS
• Retrieve (view) Data from HDFS
• Data can also be copied from HDFS to the local filesystem using
the -get command. The -get command is the opposite of the -put
command:
• For example, This command copies pg20417.txt from HDFS to the
local filesystem.
108
$ hadoop fs -get [source file] [destination file]
$ hadoop fs -get pg20417.txt .
109. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• MapReduce is a programming model that enables large volumes of data
to be processed and generated by dividing work into independent tasks
and executing the tasks in parallel across a cluster of machines.
• At a high level, every MapReduce program transforms a list of input data
elements into a list of output data elements twice, once in the map phase
and once in the reduce phase.
• The MapReduce framework is composed of three major phases: map,
shuffle and sort, and reduce.
109
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
110. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The first phase of a MapReduce application is the map phase.
Within the map phase, a function (called the mapper) processes a
series of key-value pairs.
• The mapper sequentially processes each key-value pair
individually, producing zero or more output key-value pairs
• As an example, consider a mapper whose purpose is to transform
sentences into words.
110
111. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Map
• The input to this mapper would be strings that contain sentences,
and the mapper’s function would be to split the sentences into
words and output the words
111
Image source: "Hadoop with Python", Zachary Radtka and Donald Miner, 2016
112. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Shuffle and Sort
• As the mappers begin completing, the intermediate outputs from
the map phase are moved to the reducers. This process of moving
output from the mappers to the reducers is known as shuffling.
• Shuffling is handled by a partition function, known as the
partitioner. The partitioner ensures that all of the values for the
same key are sent to the same reducer.
• The intermediate keys and values for each partition are sorted by
the Hadoop framework before being presented to the reducer.
112
113. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce
• Reduce
• Within the reducer phase, an iterator of values is provided to a
function known as the reducer. The iterator of values is a nonunique
set of values for each unique key from the output of the map phase.
• The reducer aggregates the values for each unique key and
produces zero or more output key-value pairs
• As an example, consider a reducer whose purpose is to sum all of
the values for a key. The input to this reducer is an iterator of all of
the values for a key, and the reducer sums all of the values.
113
117. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• The word-counting application takes as input one or more text
files and produces a list of word and their frequencies as output.
117
Image source: "Data Analytics with Hadoop: An Introduction for Data Scientists", Benjamin Bengfort and Jenny Kim, 2016
118. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Because Hadoop utilizes key/value pairs the input key is a file
ID and line number and the input value is a string, while the
output key is a word and the output value is an integer.
• The following Python pseudocode shows how this algorithm is
implemented:
118
# emit is a function that performs hadoop I/O
def map(dockey, line):
for word in value.split():
emit(word, 1)
def reduce(word, values):
count = sum(value for value in values)
emit(word,count)
126. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example
(Map)
126
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1) (“ran”,1)
input
Mapper 1 Mapper 2
127. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example
(Map)
127
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1)
input
Mapper 1 Mapper 2
128. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: word count
• Example
(Map)
128
(27183, “The fast cat wears no hat.”)
(31416, “The cat in the hat ran fast.”)
(“The”,1) (“The”,1)(“fast”,1) (“cat”,1)(“cat”,1) (“in”,1)
(“wears”,1) (“the”,1)(“no”,1) (“hat”,1)(“hat”,1)
(“.”,1)
(“ran”,1)
(“fast”,1) (“.”,1)
input
Mapper 1 Mapper 2
150. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: IoT
• IoT applications create an enormous amount of data that has to
be processed. This data is generated by physical sensors who
take measurements, like room temperature at 8.00 o’Clock.
• Every measurement consists of
• a key (the timestamp when the measurement has been taken) and
• a value (the actual value measured by the sensor).
• for example, (2016-05-01 01:02:03, 1).
• The goal of this exercise is to create average daily values of that
sensor’s data.
150
161. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• In the shared friendship task, the goal is to analyze a social
network to see which friend relationships users have in
common.
• Given an input data source where the key is the name of a user
and the value is a comma-separated list of friends.
161
162. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• The following Python pseudocode demonstrates how to perform
this computation:
162
def map(person, friends):
for friend in friends.split(“,”):
pair = sort([person, friend])
emit(pair,friends)
def reduce(pair, friends):
shared = set(friend[0])
shared = shared.intersection(friends[1])
emit(pair,shared)
163. http://dataminingtrend.com http://facebook.com/datacube.th
MapReduce examples: shared friendship
• The mapper create an intermediate keycap of all of the possible
(friend, friend) tuples that exist from the initial dataset.
• This allows us to analyze the dataset on a per-relationship basis as the
value is the list of associated friends.
• The pair is sorted, which ensures that the input (“Mike”,“Linda”)
and (“Linda”,“Mike”) end up being the same key during
aggregation in the reducer.
163
171. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• Hadoop streaming is a utility that comes packaged with the
Hadoop distribution and allows MapReduce jobs to be created
with any executable as the mapper and/or the reducer.
• The Hadoop streaming utility enables Python, shell scripts, or any
other language to be used as a mapper, reducer, or both.
• The mapper and reducer are both executables that
• read input, line by line, from the standard input (stdin),
• and write output to the standard output (stdout).
• The Hadoop streaming utility creates a MapReduce job, submits the job
to the cluster, and monitors its progress until it is complete.
171
172. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• When the mapper is initialized, each map task launches the
specified executable as a separate process.
• The mapper reads the input file and presents each line to the
executable via stdin. After the executable processes each line
of input, the mapper collects the output from stdout and
converts each line to a key-value pair.
• The key consists of the part of the line before the first tab
character, and the value consists of the part of the line after the
first tab character.
172
173. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming
• When the reducer is initialized, each reduce task launches the
specified executable as a separate process.
• The reducer converts the input key-value pair to lines that are
presented to the executable via stdin.
• The reducer collects the executables result from stdout and
converts each line to a key-value pair.
• Similar to the mapper, the executable specifies key-value pairs
by separating the key and value by a tab character.
173
175. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• The WordCount application can be implemented as two Python
programs: mapper.py and reducer.py.
• mapper.py is the Python program that implements the logic in
the map phase of WordCount.
• It reads data from stdin, splits the lines into words, and outputs
each word with its intermediate count to stdout.
175
176. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• mapper.py
176
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%st%s' % (word, 1)
177. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py is the Python program that implements the logic in
the reduce phase of WordCount.
• It reads the results of mapper.py from stdin, sums the
occurrences of each word, and writes the result to stdout.
• reducer.py
177
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
178. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py (cont’)
178
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
179. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• reducer.py (cont’)
179
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%st%s' % (current_word, current_count)
180. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Before attempting to execute the code, ensure that the
mapper.py and reducer.py files have execution permission.
• The following command will enable this for both files:
• Also ensure that the first line of each file contains the proper
path to Python. This line enables mapper.py and reducer.py to
execute as standalone executables.
• It is highly recommended to test all programs locally before
running them across a Hadoop cluster.
180
$ chmod +x mapper.py reducer.py
$ echo ‘The fast cat wears no hat’ | ./mapper.py | sort -t 1| ./reducer.py
181. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Download 3 ebooks from Project Gutenberg
• The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson (659 KB)
• The Notebooks of Leonardo Da Vinci (1.4 MB)
• Ulysses by James Joyce (1.5 MB)
• Before we run the actual MapReduce job, we must first copy the
files from our local file system to Hadoop’s HDFS.
181
$ hadoop fs -put pg20417.txt books/pg20417.txt
$ hadoop fs -put 5000-8.txt books/5000-8.txt
$ hadoop fs -put 4300-0.txt books/4300-0.txt
$ hadoop fs -ls books
182. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• The mapper and reducer programs can be run as a
MapReduce application using the Hadoop streaming utility.
• The command to run the Python programs mapper.py and
reducer.py on a Hadoop cluster is as follows:
182
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/
hadoop-streaming-2.0.0-mr1-cdh*.jar
-files mapper.py, reducer.py
-mapper mapper.py
-reducer reducer.py
-input /user/hduser/books/*
-output /user/hduser/books/output
183. http://dataminingtrend.com http://facebook.com/datacube.th
Hadoop Streaming example
• Options for Hadoop streaming
183
Option Description
-files A command-separated list of files to be copied to the
MapReduce cluster
-mapper The command to be run as the mapper
-reducer The command to be run as the reducer
-input The DFS input path for the Map step
-output The DFS output directory for the Reduce step
184. http://dataminingtrend.com http://facebook.com/datacube.th
Python MapReduce library: mrjob
• mrjob is a Python MapReduce library, created by Yelp, that
wraps Hadoop streaming, allowing MapReduce applications to
be written in a more Pythonic manner.
• mrjob enables multistep MapReduce jobs to be written in pure
Python.
• MapReduce jobs written with mrjob can be tested locally, run on
a Hadoop cluster, or run in the cloud using Amazon Elastic
MapReduce (EMR).
184
186. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• word_count.py
• To run the job locally and count the frequency of words within a
file named pg20417.txt, use the following command:
186
from mrjob.job import MRJob
class MRWordCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield(word, 1)
def reducer(self, word, counts):
yield(word, sum(counts))
if __name__ == '__main__':
MRWordCount.run()
$ python word_count.py books/pg20419.txt
187. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The MapReduce job is defined as the class, MRWordCount. Within the
mrjob library, the class that inherits from MRJob contains the methods
that define the steps of the MapReduce job.
• The steps within an mrjob application are mapper, combiner, and
reducer. The class inheriting MRJob only needs to define one of these
steps.
• The mapper() method defines the mapper for the MapReduce job. It
takes key and value as arguments and yields tuples of (output_key,
output_value).
• In the WordCount example, the mapper ignored the input key and split
the input value to produce words and counts.
187
188. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The combiner is a process that runs after the mapper and before
the reducer.
• It receives, as input, all of the data emitted by the mapper, and the
output of the combiner is sent to the reducer. The combiner yields
tuples of (output_key, output_value) as output.
• The reducer() method defines the reducer for the MapReduce job.
• It takes a key and an iterator of values as arguments and yields
tuples of (output_key, output_value).
• In example, the reducer sums the value for each key, which
represents the frequency of words in the input.
188
189. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• The final component of a MapReduce job written with the mrjob
library is the two lines at the end of the file:
if __name__ == '__main__':
MRWordCount.run()
• These lines enable the execution of mrjob; without them, the
application will not work.
• Executing a MapReduce application with mrjob is similar to
executing any other Python program. The command line must
contain the name of the mrjob application and the input file:
189
$ python mr_job.py input.txt
190. http://dataminingtrend.com http://facebook.com/datacube.th
mrjob example
• By default, mrjob runs locally, allowing code to be developed
and debugged before being submitted to a Hadoop cluster.
• To change how the job is run, specify the -r/--runner option.
190
$ python word_count.py -r hadoop hdfs:books/pg20419.txt
192. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• The Hadoop ecosystem emerged as a cost effective way of working
with large datasets
• It imposes a particular programming model, called MapReduce, for
breaking up computation tasks into units that can be distributed around
a cluster of commodity
• Underneath this computation model is a distributed file system called
Hadoop Distributed Filesystem (HDFS)
• However, a challenge remains; how do you move an existing data
infrastructure to Hadoop, when that infrastructure is based on traditional
relational databases and the Structured Query Language (SQL)?
192
193. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• This is where Hive comes in. Hive provides an SQL dialect, called
Hive Query Language (abbreviated HiveQL or just HQL) for querying
data stored in a Hadoop cluster.
• SQL knowledge is widespread for a reason; it’s an effective,
reasonably intuitive model for organizing and using data.
• Mapping these familiar data operations to the low-level MapReduce
Java API can be daunting, even for experienced Java developers.
• Hive does this dirty work for you, so you can focus on the query itself.
Hive translates most queries to MapReduce jobs, thereby exploiting
the scalability of Hadoop, while presenting a familiar SQL abstraction.
193
194. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction
• Hive is most suited for data warehouse applications, where relatively
static data is analyzed, fast response times are not required, and when
the data is not changing rapidly.
• Apache Hive is a “data warehousing” framework built on top of
Hadoop.
• Hive provides data analysts with a familiar SQL-based interface to
Hadoop, which allows them to attach structured schemas to data in
HDFS and access and analyze that data using SQL queries.
• Hive has made it possible for developers who are fluent in SQL to
leverage the scalability and resilience of Hadoop without requiring them
to learn Java or the native MapReduce API.
194
196. http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• There are several ways to interact with Hive
• CLI: command-line interface
• GUI: Graphic User Interface
• Karmasphere (http://karmasphere.com)
• Cloudera’s open source Hue (https://github.com/cloudera/hue)
• All commands and queries go to the Driver, which compiles the
input, optimizes the computation required, and executes the
required steps, usually with MapReduce jobs.
196
197. http://dataminingtrend.com http://facebook.com/datacube.th
Hive in the Hadoop Ecosystem
• Hive communicates with the JobTracker to initiate the MapReduce job.
• Hive does not have to be running on the same master node with the
JobTracker. In larger clusters, it’s common to have edge nodes where
tools like Hive run.
• They communicate remotely with the JobTracker on the master node
to execute jobs. Usually, the data files to be processed are in HDFS,
which is managed by the NameNode.
• The Metastore is a separate relational database (usually a MySQL
instance) where Hive persists table schemas and other system
metadata.
197
198. http://dataminingtrend.com http://facebook.com/datacube.th
Structured Data Queries with Hive
• Hive provides its own dialect of SQL called the Hive Query Language,
or HQL.
• HQL supports many commonly used SQL statements, including data
definition statements (DDLs) (e.g., CREATE DATABASE/ SCHEMA/ TABLE),
data manipulation statements (DMSs) (e.g., INSERT, UPDATE, LOAD),
and data retrieval queries (e.g., SELECT).
• Hive commands and HQL queries are compiled into an execution plan
or a series of HDFS operations and/ or MapReduce jobs, which are
then executed on a Hadoop cluster.
198
199. http://dataminingtrend.com http://facebook.com/datacube.th
Structured Data Queries with Hive
• Additionally, Hive queries entail higher-latency due to the overhead
required to generate and launch the compiled MapReduce jobs on the
cluster; even small queries that would complete within a few seconds
on a traditional RDBMS may take several minutes to finish in Hive.
• On the plus side, Hive provides the high-scalability and high-
throughput that you would expect from any Hadoop-based
application.
• It is very well suited to batch-level workloads for online analytical
processing (OLAP) of very large datasets at the terabyte and petabyte
scale.
199
200. http://dataminingtrend.com http://facebook.com/datacube.th
The Hive Command-Line Interface (CLI)
• Hive’s installation comes packaged with a handy command-line
interface (CLI), which we will use to interact with Hive and run
our HQL statements.
• This will initiate the CLI and bootstrap the logger (if configured)
and Hive history file, and finally display a Hive CLI prompt:
• You can view the full list of Hive options for the CLI by using the
-H flag:
200
$ hive
hive>
$ hive -H
205. http://dataminingtrend.com http://facebook.com/datacube.th
Creating a database
• Creating a database in Hive is very similar to creating a
database in a SQL-based RDBMS, by using the CREATE
DATABASE or CREATE SCHEMA statement:
• When Hive creates a new database, the schema definition data
is stored in the Hive metastore.
• Hive will raise an error if the database already exists in the
metastore; we can check for the existence of the database by
using IF NOT EXISTS:
• HQL: CREATE DATABASE IF NOT EXISTS flight_data;
205
206. http://dataminingtrend.com http://facebook.com/datacube.th
Creating a database
• We can then run SHOW DATABASES to verify that our database has
been created. Hive will return all databases found in the
metastore, along with the default Hive database:
• HQL: SHOW DATABASES;
206
207. http://dataminingtrend.com http://facebook.com/datacube.th
Creating tables
• Hive provides a SQL-like CREATE TABLE statement, which in its
simplest form takes a table name and column definitions:
• HQL: CREATE TABLE airlines (code INT,
description STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
• However, because Hive data is stored in the file system, usually
in HDFS or the local file system
• the CREATE TABLE command also takes optional clauses to
specify the row format with the ROW FORMAT clause that tells
Hive how to read each row in the file and map to our columns.
207
208. http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• It’s important to note one important distinction between Hive and
traditional RDBMSs with regards to schema enforcement:
• Traditional relational databases enforce the schema on writes
by rejecting any data that does not conform to the schema as
defined;
• Hive can only enforce queries on schema reads. If in reading
the data file, the file structure does not match the defined
schema, Hive will generally return null values for missing fields
or type mismatches
208
209. http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• Data loading in Hive is done in batch-oriented fashion using a bulk LOAD
DATA command or by inserting results from another query with the
INSERT command.
• LOAD DATA is Hive’s bulk loading command. INPATH takes an argument
to a path on the default file system (in this case, HDFS).
• We can also specify a path on the local file system by using LOCAL
INPATH instead. Hive proceeds to move the file into the warehouse
location.
• If the OVERWRITE keyword is used, then any existing data in the target
table will be deleted and replaced by the data file input; otherwise, the
new data is added to the table.
209
210. http://dataminingtrend.com http://facebook.com/datacube.th
Loading data
• Examples
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/ontime_flights.tsv'
OVERWRITE INTO TABLE flights;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/airlines.tsv'
OVERWRITE INTO TABLE airlines;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/carriers.tsv'
OVERWRITE INTO TABLE carriers;
• HQL: LOAD DATA LOCAL INPATH '/home/cloudera/
Downloads/flight_data/cancellation_reasons.tsv'
OVERWRITE INTO TABLE cancellation_reasons;
210
212. http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Aggregations
• HQL:
SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS
num_depart_delays,
SUM(IF(arrive_delay > 0, 1, 0)) AS
num_arrive_delays,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
FROM flights
GROUP BY airline_code;
212
213. http://dataminingtrend.com http://facebook.com/datacube.th
Data Analysis with Hive
• Aggregations
• HQL:
SELECT airline_code,
COUNT(1) AS num_flights,
SUM(IF(depart_delay > 0, 1, 0)) AS num_depart_delays,
ROUND(SUM(IF(depart_delay > 0, 1, 0))/COUNT(1), 2)
AS depart_delay_rate,
SUM(IF(arrive_delay > 0, 1, 0)) AS num_arrive_delays,
ROUND(SUM(IF(arrive_delay > 0, 1, 0))/COUNT(1), 2)
AS arrive_delay_rate,
SUM(IF(is_cancelled, 1, 0)) AS num_cancelled,
ROUND(SUM(IF(is_cancelled, 1, 0))/COUNT(1), 2)
AS cancellation_rate
FROM flights
GROUP BY airline_code
ORDER by cancellation_rate DESC, arrive_delay_rate DESC,
depart_delay_rate DESC;
213
214. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to HBase
• While Hive provides a familiar data manipulation paradigm within
Hadoop, it doesn’t change the storage and processing paradigm,
which still utilizes HDFS and MapReduce in a batch-oriented fashion.
• Thus, for use cases that require random, real-time read/ write access
to data, we need to look outside of standard MapReduce and Hive for
our data persistence and processing layer.
• The real-time applications need to record high volumes of time-based
events that tend to have many possible structural variations.
• The data may be keyed on a certain value, like User, but the value is
often represented as a collection of arbitrary metadata.
214
215. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to HBase
• For example, two events, “Like” and “Share”, which require different column
values, as shown in table.
• In a relational model, rows are sparse but columns are not. That is, upon
inserting a new row to a table, the database allocates storage for every column
regardless of whether a value exists for that field or not.
• However, in applications where data is represented as a collection of arbitrary
fields or sparse columns, each row may use only a subset of available columns,
which can make a standard relational schema both a wasteful and awkward fit.
215
216. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• NoSQL is a broad term that generally refers to non-relational
databases and encompasses a wide collection of data storage
models, including
• graph databases
• document databases
• key/ value data stores
• column-family databases.
• HBase is classified as a column-family or column-oriented database,
modelled on Google’s Big Table architecture.
216
217. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• HBase organizes data into tables that contain rows. Within a
table, rows are identified by their unique row key, which do not
have a data type.
• Row key are similar to the concept of primary keys in relational
databases, in that they are automatically indexed.
217
218. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• In HBase, table rows are sorted by their row key and because
row keys are byte arrays, almost anything can serve as a row
key from strings to binary representations of longs or even
serialized data structures.
• HBase stores its data key/value pairs, where all table lookups
are performed via the table’s row key, or unique identifier to the
stored record data.
• Data within a row is grouped into column families, which consist
of related columns.
218
220. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• Storing data in columns rather than rows has particular benefits for
data warehouses and analytical databases where aggregates are
computed over large sets of data with potentially sparse values, where
not all columns values are present.
• Another interesting feature of HBase and BigTable-based column-
oriented databases is that the table cells, or the intersection of row and
column coordinates, are versioned by timestamp.
• HBase is thus also described as being a multidimensional map where
time provides the third dimension
220
221. http://dataminingtrend.com http://facebook.com/datacube.th
Column-Oriented Databases
• The time dimension is indexed in decreasing order, so that
when reading from an HBase store, the most recent values are
found first.
• The contents of a cell can be
referenced by a
{rowkey, column, timestamp}
tuple, or we can scan for a
range of cell values by time
range.
221
222. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• For the purposes of this HBase overview, we define and work with the
HBase shell to design a schema for a linkshare tracker that tracks the
number of times a link has been shared.
• Generating a schema
• When designing schemas in HBase, it’s important to think in terms
of the column-family structure of the data model and how it affects
data access patterns.
• Furthermore, because HBase doesn’t support joins and provides
only a single indexed rowkey, we must be careful to ensure that the
schema can fully support all use cases.
222
223. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• First, we need to declare the table name, and at least one
column-family name at the time of table definition.
• If no namespace is declared, HBase will use the default
namespace
• We just created a single table called linkshare in the default
namespace with one column-family, named link
• To alter the table after creation, such as changing or adding column
families, we need to first disable the table so that clients will not be able
to access the table during the alter operation:
223
hbase> create ‘linkshare’,’link’
224. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• Good row key design affects not only how we query the table, but the
performance and complexity of data access.
• By default, HBase stores rows in sorted order by row key, so that
similar keys are stored to the same RegionServer.
• Thus, in addition to enabling our data access use cases, we also need
to be mindful to account for row key distribution across regions.
• For the current example, let’s assume that we will use the unique
reversed link URL for the row key.
224
hbase> disable ‘linkshare’
hbase> alter ‘linkshare’, ‘statistics’
hbase> enable ‘linkshare’
225. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• In our linkshare application, we want to store descriptive data about
the link, such as its title, while maintaining a frequency counter that
tracks the number of times the link has been shared.
• We can insert, or put, a value in a cell at the specified table/ row/
column and optionally timestamp coordinates.
• To put a cell value into table linkshare at row with row key
org.hbase.www under column-family link and column title marked with
the current timestamp
225
hbase> put 'linkshare', 'org.hbase.www', 'link:title', 'Apache HBase'
hbase> put 'linkshare', 'org.hadoop.www', 'link:title', 'Apache Hadoop'
hbase> put 'linkshare', 'com.oreilly.www', 'link:title', ‘O’Reilly.com’
226. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• The put operation works great for inserting a value for a single cell, but for
incrementing frequency counters, HBase provides a special mechanism
to treat columns as counters.
• To increment a counter, we use the command incr instead of put.
• The last option passed is the increment value, which in this case is 1.
• Incrementing a counter will return the updated counter value, but you can
also access a counter’s current value any time using the get_counter
command, specifying the table name, row key, and column:
226
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:like’, 1
227. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• HBase provides two general methods to retrieve data from a table:
• the get command performs lookups by row key to retrieve attributes
for a specific row,
• and the scan command, which takes a set of filter specifications and
iterates over multiple rows based on the indicated specifications.
• In its simplest form, the get command accepts the table name
followed by the row key, and returns the most recent version timestamp
and cell value for columns in the row.
227
hbase> incr ‘linkshare’, ‘org.hbase.www’, ‘statistics:share’, 1
hbase> get_counter ‘linkshare’, ‘org.hbase.www’,
‘statistics:share’
hbase> get ‘linkshare’, ‘org.hbase.www’
228. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• The get command also accepts an optional dictionary of parameters to
specify the column( s), timestamp, timerange, and version of the cell
values we want to retrieve. For example, we can specify the column( s) of
interest
• A scan operation is akin to database cursors or iterators, and takes
advantage of the underlying sequentially sorted storage mechanism,
iterating through row data to match against the scanner specifications.
• With scan, we can scan an entire HBase table or specify a range of rows
to scan.
228
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’
hbase> get ‘linkshare’, ‘org.hbase.www’, ‘link:title’,
‘statistics:share’
229. http://dataminingtrend.com http://facebook.com/datacube.th
Real-Time Analytics with HBase
• You can specify an optional STARTROW and/ or STOPROW
parameter, which can be used to limit the scan to a specific
range of rows.
• If neither STARTROW nor STOPROW are provided, the scan
operation will scan through the entire table.
• You can, in fact, call scan with the table name to display all the
contents of a table.
229
hbase> scan ‘linkshare’
hbase> scan 'linkshare', {COLUMNS = > [' link:title'],
STARTROW = > 'org.hbase.www'}
230. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to Sqoop
• However, in cases where the input data is already structured because
it resides in a relational database, it would be convenient to leverage
this known schema to import the data into Hadoop in a more efficient
manner than uploading CSVs to HDFS and parsing them manually.
• Sqoop (SQL-to-Hadoop) is designed to transfer data between
relational database management systems (RDBMS) and Hadoop.
• It automates most of the data transfer process by reading the schema
information directly from the RDBMS.
• Sqoop then uses MapReduce to import and export the data to and
from Hadoop.
230
231. http://dataminingtrend.com http://facebook.com/datacube.th
Introduction to Sqoop
• Sqoop gives us the flexibility to maintain our data in its production
state while copying it into Hadoop to make it available for further
analysis without modifying the production database.
• We’ll walk through a few ways to use Sqoop to import data from a
MySQL database into various Hadoop data stores, including HDFS,
Hive, and HBase.
• We will use MySQL as the source and target RDBMS for the examples
in this chapter, so we also assume that a MySQL database resides on
the same host as your Hadoop/ Sqoop services and is accessible via
localhost and the default port, 3306.
231
232. http://dataminingtrend.com http://facebook.com/datacube.th
Importing from MySQL to HDFS
• When importing data from relational databases like MySQL, Sqoop
reads the source database to gather the necessary metadata for the
data being imported.
• Sqoop then submits a map-only Hadoop job to transfer the actual table
data based on the metadata that was captured in the previous step.
• This job produces a set of serialized files, which may be delimited text
files, binary format, or SequenceFiles containing a copy of the imported
table or datasets.
• By default, the files are saved as comma-separated files to a directory
on HDFS with a name that corresponds to the source table name.
232