This document outlines steps to analyze tweets using Flume, Hadoop, and Hive. It describes installing Flume and a jar file, creating a Twitter application to get API keys, configuring an Flume agent to fetch tweets from Twitter and store in HDFS, and using Hive to analyze the tweets by user followers count. The workshop instructions provide commands to stream tweets from Twitter to HDFS using Flume, view the stored tweets files, and run queries in Hive to find the user with the most followers.
This workshop is for a "Big Data using Hadoop course" at IMC Institute in March 2015. The workshop is based on Apache Hadoop and using an EC2 server on AWS.
This workshop is for a "Big Data using Hadoop course" at IMC Institute in March 2015. The workshop is based on Apache Hadoop and using an EC2 server on AWS.
Interview questions on Apache spark [part 2]knowbigdata
This is Apache Spark Question & Answer Tutorial.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
At teowaki we have a system for API use analytics using Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will speak of the alternatives we evaluated and how we are using Redis and Bigquery to solve our problem.
New developments in open source ecosystem spark3.0 koalas delta lakeXiao Li
In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.
Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
Quick overview of programming Apache Hadoop with R. Jonathan Seidman's sample code allows a quick comparison of several packages followed by a real example using RHadoop's rmr package. Our example demonstrates using compound (vs. single-field) keys and values and shows the data coming into and out of our mapper and reducer functions.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
Interview questions on Apache spark [part 2]knowbigdata
This is Apache Spark Question & Answer Tutorial.
We provide training on Big Data & Hadoop,Hadoop Admin ,MongoDB,Data Analytics with R, Python..etc
Our Big Data & Hadoop course consists of Introduction of Hadoop and Big Data,HDFS architecture ,MapReduce ,YARN ,PIG Latin ,Hive,HBase,Mahout,Zookeeper,Oozie,Flume,Spark,Nosql with quizzes and assignments.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
At teowaki we have a system for API use analytics using Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will speak of the alternatives we evaluated and how we are using Redis and Bigquery to solve our problem.
New developments in open source ecosystem spark3.0 koalas delta lakeXiao Li
In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.
Apache Storm is a free and open source, distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Its effective stream processing capabilities are trusted by Twitter and Yahoo for quickly extracting insights from their Big Data.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
Quick overview of programming Apache Hadoop with R. Jonathan Seidman's sample code allows a quick comparison of several packages followed by a real example using RHadoop's rmr package. Our example demonstrates using compound (vs. single-field) keys and values and shows the data coming into and out of our mapper and reducer functions.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
Hadoop or Spark: is it an either-or proposition? An exodus away from Hadoop to Spark is picking up steam in the news headlines and talks! Away from marketing fluff and politics, this talk analyzes such news and claims from a technical perspective.
In practical ways, while referring to components and tools from both Hadoop and Spark ecosystems, this talk will show that the relationship between Hadoop and Spark is not of an either-or type but can take different forms such as: evolution, transition, integration, alternation and complementarity.
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SMEIMC Institute
เอกสารบรรยายงานสัมมนา Cloud Computing
New generation of SMEs Management for ASEAN Economic Community (AEC) by using Cloud Computing Technology วันเสาร์ที่ 28 กุมภาพันธ์ 2558 เวลา 12.30 – 15.45น.
ณ โรงแรม The Emerald Hotel-Bangkok
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
This presentation contains brief description about big data along with that hadoop installation, configuration and MapReduce wordcount program and its explanation.
HDFS tiered storage: mounting object stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to HDFS allowing mounting external namespaces, both object stores and other HDFS clusters. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. This talk, which corresponds to the work in progress under HDFS-12090, will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speaker
Thomas Demoor, Object Storage Architect, Western Digital
Ewan Higgs, Software Engineer, Western Digital
Slides for talk presented at Boulder Java User's Group on 9/10/2013, updated and improved for presentation at DOSUG, 3/4/2014
Code is available at https://github.com/jmctee/hadoopTools
This presentation covers web filtering with Squid and DansGuardian, proxy auto-detection, router access control, computer time limits and access control for applications.
Terraform is an infrastructure-as-code software solution that enables businesses to safely and predictably create, change, and improve infrastructure. It is all the rage right now in the DevOps community. In this presentation, experts from HG Data share some hands-on experiences and introduce some practical applications of terraform beyond the traditional “Hello World” example. This presentation covers evolving Terraform projects, implementing behavioral testing, and tips to achieve blue/green deployment zen.
This presentation covers deploying Django in production server. The topics that are mainly covered includes
O Environment setup
O Server configuration
O Deploying in AWS
This Hadoop HDFS Tutorial will unravel the complete Hadoop Distributed File System including HDFS Internals, HDFS Architecture, HDFS Commands & HDFS Components - Name Node & Secondary Node. Not only this, even Mapreduce & practical examples of HDFS Applications are showcased in the presentation. At the end, you'll have a strong knowledge regarding Hadoop HDFS Basics.
Session Agenda:
✓ Introduction to BIG Data & Hadoop
✓ HDFS Internals - Name Node & Secondary Node
✓ MapReduce Architecture & Components
✓ MapReduce Dataflows
----------
What is HDFS? - Introduction to HDFS
The Hadoop Distributed File System provides high-performance access to data across Hadoop clusters. It forms the crux of the entire Hadoop framework.
----------
What are HDFS Internals?
HDFS Internals are:
1. Name Node – This is the master node from where all data is accessed across various directores. When a data file has to be pulled out & manipulated, it is accessed via the name node.
2. Secondary Node – This is the slave node where all data is stored.
----------
What is MapReduce? - Introduction to MapReduce
MapReduce is a programming framework for distributed processing of large data-sets via commodity computing clusters. It is based on the principal of parallel data processing, wherein data is broken into smaller blocks rather than processed as a single block. This ensures a faster, secure & scalable solution. Mapreduce commands are based in Java.
----------
What are HDFS Applications?
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
1. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1
Analyse Tweets using Flume 1.4,
Hadoop 2.7 and Hive
May 2015
Dr.Thanachart Numnonda
Certified Java Programmer
thanachart@imcinstitute.com
Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com
2. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Lecture: Understanding Flume
3. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Introduction
Apache Flume is:
●
A distributed data transport and aggregation system for
event- or log-structured data
●
Principally designed for continuous data ingestion into
Hadoop… But more flexible than that
4. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Architecture Overview
odiago
5. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Flume terminology
●
Every machine in Flume is a node
●
Each node has a source and a sink
●
Some sinks send data to collector nodes, which
aggregate data from many agents before writing to HDFS
●
All Flume nodes heartbeat to/receive config from master
●
Events enter Flume within seconds of generation
Odiago
6. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Flume isn’t an analytic system
●
No ability to inspect
message bodies
●
No notion of aggregates,
rolling counters, etc
Odiago
7. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Hands-On: Loading Twitter Data to
Hadoop HDFS
8. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Exercise Overview
Hive.apache.org
10. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
1. Installing Flume (cont.)
Edit $HOME ./bashrc
$ sudo vi $HOME/.bashrc
$ exec bash
11. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
2. Installing a jar file
$ wget http://files.cloudera.com/samples/flume-sources-1.0-
SNAPSHOT.jar
$ sudo mv flume-sources-1.0-SNAPSHOT.jar
/usr/local/flume/lib/
$ cd /usr/local/flume/conf/
$ sudo cp flume-env.sh.template flume-env.sh
$ sudo vi flume-env.sh
Copy a jar file and edit conf file
12. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App
Login to your Twitter @ twitter.com
13. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App (cont.)
Create a new Twitter App @ apps.twitter.com
14. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App (cont.)
Enter all the details in the application:
15. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App (cont.)
Your application will be created:
16. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App (cont.)
Click on Keys and Access Tokens:
17. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App (cont.)
Click on Keys and Access Tokens:
18. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
3. Create a new Twitter App (cont.)
Your Access token got created:
19. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
4. Configuring the Flume Agent
Copy the flume.conf file from the following url:
https://github.com/cloudera/cdh-twitter-
example/blob/master/flume-sources/flume.conf
$ vi /usr/local/flume/conf/flume.conf
flume.conf file
20. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Fixing Bug for running Flume1.4 on Hadoop 2.x
Need to remove file guava-10.0.1.jar and
protobuf-java-2.4.1.jar
$ cd /usr/local/flume/rm
$ rm guava-10.0.1.jar
$ rm protobuf-java-2.4.1.jar
21. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
5. Fetching the data from twitter
$ flume-ng agent -n TwitterAgent -c conf -f
/usr/local/flume/conf/flume.conf
Wait for 60-90 seconds and let flume stream the data on
HDFS, then press Ctrl-c to break the command and stop the
streaming. (Ignore the exceptions)
22. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
6. View the straming data
$ hdfs dfs -ls /user/flume/tweets
$ hdfs dfs -cat /user/flume/tweets/FlumeData.1431058050787
$ hdfs dfs -rm /user/flume/tweets/*.tmp
23. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
7. Analyse data using Hive
$ wget
http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar
$ mv hive-serdes-1.0-SNAPSHOT.jar /usr/local/apache-hive-
1.1.0-bin/lib/
$ hive
hive> ADD JAR /usr/local/apache-hive-1.1.0-bin/lib/hive-
serdes-1.0-SNAPSHOT.jar;
Get a Serde Jar File for parsing JSON file
Register the Jar file.
24. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
7. Analyse data using Hive (cont.)
Running the following hive command
http://www.thecloudavenue.com/2013/03/analyse-tweets-using-flume-hadoop-and.html
25. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
7. Analyse data using Hive (cont)
hive> elect user.screen_name, user.followers_count c from
tweets order by c desc;
Finding user who has the most number of followers
26. Danairat T., , danairat@gmail.com: Thanachart Numnonda, thanachart@imcinstitute.com Apr 2015Big Data using Hadoop workshop
Thank you
www.imcinstitute.com
www.facebook.com/imcinstitute