Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Timothy	Spann
2017	Future	of	Data	– Princeton	Meetup	
May	16,	2017
I...
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
• 6:00pm – 6:45pm Registration and Food
• 6:45pm – 7:00pm Int...
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
All	Your	Data	Are	Belong	to	Hadoop
à CSV
à JSON
à TSV
à TEXT
à PDF
Ã...
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Big	Data	Ecosystem
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop	Architecture
Data	Access	Engines
Distributed	Reliable	Storage...
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HORTONWORKS	DATA	PLATFORM
DATA	MGMT
HDP	2.2
Dec	2014
HDP	2.1
April	2...
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache Hadoop = Storage + Compute
storage storage
storage storage
Ha...
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
`
+
/directory/structure/in/memory.txt
Resource management + schedul...
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Sto...
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDFS	Storage	Architecture	- Before
Before
• DataNode is	a	single	st...
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Cloud
Storage
HDFS	Storage	Architecture	- Now
New Architecture
• Da...
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
It Looks Like a File System
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
WebHDFS
https://community.hortonworks.com/articles/60480/using-imag...
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hadoop	emerged	as	foundation	of	new	data	architecture
Apache	Hadoop...
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
YARN	extends	Hadoop	into	data	center	leaders
YARN
The Architectural...
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Overview of SQL on Hadoop Solutions
Spark's	module	for	working	with...
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SQL on Hadoop: Vitals
Project First	GA	Release
Lines	of	Code
(June	...
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Hive:	Fast	Facts
Most	Queries	Per	Hour
100,000	Queries	Per	H...
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Phoenix	and	HBase:	Fast	Facts
Largest	Database
5	Petabytes
(Flurry)...
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Types SQL Features File Formats
Numeric Core	SQL	Features Colu...
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
S Q L 	 W I T H 	 H I V E
D A T A 	 A C C E S S
Apache	Hive	facilit...
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Z E P P E L I N 	 N OT E B O O K
T O O L S
Apache	Zeppelin	is	a	Web...
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
A M B A R I 	 V I E W S
T O O L S
Ambari	Views	are	a	built-in	set	o...
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
A M B A R I 	 V I E W S
T O O L S
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hive Server 2
(compile, optimize, execute)
HDFS
Apache	Hive	Archite...
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Submitting	Hive	Queries
à Hive	CLI
– Traditional	Hive	client	that	c...
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Best Practices for Data Loading - Create External Table, Create
ORC...
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Formats
More	Flexible Better	Storage	and	Performance///
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	Is	Apache	Spark?
à Apache	open	source	project	
originally	deve...
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark?
à Elegant	Developer	APIs
– Single	environment	for...
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLli...
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
à Distributed collection of	data organized into	named	
c...
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sources
CSVAvro
HIVE
Spark	SQL
Col1 Col2 … … ColN
DataFrame
Column
...
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Create	a	DataFrame
val path = "examples/flights.json"
val flights =...
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Rapid	Ecosystem	Adoption:	210+	Processors
HTTP
Syslog
Email
HTML
Im...
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
FlowFiles are	like	HTTP	data
HTTP	Data FlowFile
HTTP/1.1	200	OK
Dat...
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
http://hortonworks.com/blog/hdf-2-0-flow-processing-real-time-tweet...
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
FlowFile:	Each	piece	of	"User	Data"	(i.e.,	data	that	the	user	bring...
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
InvokeHttp
GetTwitter
Input
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
RouteOnAttribute
ExecuteStreamCommand
UpdateAttribute
Processing
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
PutHDFS:			Have	access	to	your	Hadoop	HDFS	from	the	NIFI	box	and	ha...
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
python	classify_image.py --image_file /opt/demo/dronedataold/Bebop2...
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
/opt/demo/sentiment/run.sh
python	/opt/demo/sentiment/sentiment.py ...
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
https://pip.pypa.io/en/latest/installing/
http://www.nltk.org/insta...
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Results
47 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Results
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Installation
Download the binary from here:
http://hortonworks.com/...
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Contact:
Timothy	Spann			@PaaSDeV
www.meetup.com/futureofdata-princ...
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Community	Connection
Read access for everyone, join to ...
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Community	Engagement
Participate now at: community.hortonworks.com©...
Upcoming SlideShare
Loading in …5
×

Introduction to Hadoop

Introduction to Hadoop, Hive, Spark, HDFS, NiFi, Zeppelin, Ambari and other Hadoop / Apache Big Data Tools. With a focus on pure open source and HDP 2.6 from Hortonworks.

Introduction to Hadoop

  1. 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Timothy Spann 2017 Future of Data – Princeton Meetup May 16, 2017 Introduction to Hadoop Tools
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda • 6:00pm – 6:45pm Registration and Food • 6:45pm – 7:00pm Introduction and Welcome • 7:00pm – 7:45 pm Install and Cloud by Milind Pandit • 7:45pm – 8:30 pm Tools by Tim Spann • 8:30pm – 8:45pm Questions
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved All Your Data Are Belong to Hadoop à CSV à JSON à TSV à TEXT à PDF à XML à HTML à AVRO à PARQUET à ORC à Sequence File à HFile à JPEG à PNG à MP4 à RTF à Word à Excel
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Big Data Ecosystem
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Architecture Data Access Engines Distributed Reliable Storage Distributed Compute Framework Resource Management, Data LocalityData Operating System Batch Interactive Real-time Governance & Integration Security Applications Deploy Anywhere
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HORTONWORKS DATA PLATFORM DATA MGMT HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 2.2.0 2.4.0 2.6.0 Ongoing Innovation in Apache HDFS YARN MapReduce Hadoop Core What is Apache Hadoop? Yahoo! 2006 Hortonworks Oct 2011 Yahoo! start focus on multiple Hadoop apps & clusters Contributes Hadoop to Apache 2008 HDP 1.0 Oct 2012 Apache Hadoop v2 YARN Google publishes GFS & MapReduce papers 2004-2005 HDP 2.4 March 2016 2.7.1 HDP 2.2 Dec 2014 HDP 2.3 July 2015 2.7.1 HDP 2.5 Aug 2016 2.7.3
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hadoop = Storage + Compute storage storage storage storage Hadoop Distributed File System (HDFS) CPU RAM Yet Another Resource Negotiator (YARN)
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ` + /directory/structure/in/memory.txt Resource management + schedulingDisk, CPU, Memory Core NameNode HDFS ResourceManager YARN Hadoop daemon User application NN RM DataNode HDFS NodeManager YARN Worker Node
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Distributed File System (HDFS) Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Storage Architecture - Before Before • DataNode is a single storage • Storage is uniform - Only storage type Disk • Storage types hidden from the file system All disks as single storage
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloud Storage HDFS Storage Architecture - Now New Architecture • DataNode is a collection of storages • Support different types of storages – Disk, SSDs, Memory Block Storage Policies – Describes how to store data blocks in HDFS Collection of tier storage
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved It Looks Like a File System
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved WebHDFS https://community.hortonworks.com/articles/60480/using-images-stored-in-hdfs-for-web-pages.html http://princeton.server.com:50070/webhdfs/v1/demo/clickstream/weblogs/clickstream-feed- generated.tsv?OP=open
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages ü Manages new data paradigm ü Handles data at scale ü Cost effective ü Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN extends Hadoop into data center leaders YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions YARN : Data Operating System BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS Hadoop Distributed File System DATA MANAGEMENT Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Spark Others ISV Engines Tez Tez Slider Slider
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview of SQL on Hadoop Solutions Spark's module for working with structured data. Run SQL queries alongside complex analytic algorithms. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. High performance relational database layer over HBase for low latency applications. Traditional MPP on Hadoop Many traditionally architected MPP solutions have been ported to Hadoop and some new ones have been developed from scratch.
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL on Hadoop: Vitals Project First GA Release Lines of Code (June 2015) (*) Most Typical Use Apache Hive April, 2009 (7 Years) 1 Million EDW / ETL Offload SparkSQL March, 2015 (4 Months) 56.6k Exploratory Analytics Apache Phoenix March, 2014 (2 Year) 200k Low-Latency Dashboards
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive: Fast Facts Most Queries Per Hour 100,000 Queries Per Hour (Yahoo Japan) Analytics Performance 100 Million rows/s Per Node (with Hive LLAP) Largest Hive Warehouse 300+ PB Raw Storage (Facebook) Largest Cluster 4,500+ Nodes (Yahoo)
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Phoenix and HBase: Fast Facts Largest Database 5 Petabytes (Flurry) Best Known App Facebook Messages (Facebook) Fastest Ingestion 10 Million Events/s (Yahoo) Biggest SQL App Real-Time SQL on 140m+ Records (PubMatic)
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Types SQL Features File Formats Numeric Core SQL Features Columnar FLOAT/DOUBLE Date, Time and Arithmetical Functions ORCFile DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet INT/TINYINT/SMALLINT/BIGINT Derived Table Subqueries Text BOOLEAN Correlated and Uncorrelated Subqueries CSV String UNION ALL Logfile CHAR / VARCHAR UDFs, UDAFs, UDTFs Nested / Complex STRING Common Table Expressions Avro BINARY UNION DISTINCT JSON Date, Time INTERSECT, EXCEPT XML DATE Non-Equality Joins Custom Formats TIMESTAMP Advanced Analytics Other Features Interval Types OLAP and Windowing Functions XPath Analytics Complex Types CUBE and Grouping Sets Procedural Extensions (PL/Hive) ARRAY Nested Data Analytics MAP Nested Data Traversal STRUCT Lateral Views UNION ACID Transactions INSERT / UPDATE / DELETE MERGE Apache Hive: Journey to SQL:2011 Analytics Legend Hive 1.0 Future Hive 1.2
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved S Q L W I T H H I V E D A T A A C C E S S Apache Hive facilitates querying and managing large datasets. Hive provides SQL on Hadoop. Data analysts use Hive to explore, structure and analyze that data using the familiar comfortable SQL syntax they are used to. Hive also comes with HCatalog; a global metadata management layer that exposes Hive table metadata to all other Hadoop applications. OPERATIONS SECURITY GOVERNANCE STORAGE STORAGE
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Z E P P E L I N N OT E B O O K T O O L S Apache Zeppelin is a Web-based notebook that enables interactive data analytics. Data Scientists and End-Users alike can make beautiful data-driven, interactive and collaborative documents with SparkSQL, Scala, Python, JDBC connections, Files, and more. Notebooks contain code samples, source data, descriptive markup, result sets, and rich visualizations. OPERATIONS SECURITY GOVERNANCE STORAGE STORAGE
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A M B A R I V I E W S T O O L S Ambari Views are a built-in set of Views that are pre- deployed for you to use with your cluster. These GUI components increase ease-of-use to end users. Current Ambari Views include Hive, Pig, Tez, Capacity Scheduler, File, HDFS. The Ambari Views Framework allow developers to create new user interface components that plug into the Ambari web interface. OPERATIONS SECURITY GOVERNANCE STORAGE STORAGE
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A M B A R I V I E W S T O O L S
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Server 2 (compile, optimize, execute) HDFS Apache Hive Architecture Client – beeline, Ambari Hive View, Zeppelin, BI of Choice SQL over JDBC/ ODBC database Table 1 Partition 1 Table 2 Partition 2 Hive MetaStore TEZ / MR Data in HDFS • Structured • Unstructured • Semi structured Schema definitions Distribution Engine Data Storage Interpreter
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Submitting Hive Queries à Hive CLI – Traditional Hive client that connects to a HiveServer instance – $ hive hive> à Beeline – A new command line client that connects to a HiveServer2 instance – $ beeline beeline> !connect jdbc:hive2://hostname:10000 username password org.apache.hive.jdbc.HiveDriver
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Best Practices for Data Loading - Create External Table, Create ORC Table hdfs dfs -copyFromLocal cars.csv /visdata CREATE EXTERNAL TABLE IF NOT EXISTS Cars(Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location ’/visdata'; CREATE TABLE IF NOT EXISTS mycars( Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC; INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Formats More Flexible Better Storage and Performance///
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Is Apache Spark? Ã Apache open source project originally developed at AMPLab (University of California Berkeley) Ã Unified data processing engine that operates across varied data workloads and platforms
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Apache Spark? Ã Elegant Developer APIs – Single environment for data munging, data wrangling, and Machine Learning (ML) Ã In-memory computation model – Fast! – Effective for iterative computations and ML Ã Machine Learning – Implementation of distributed ML algorithms – Pipeline API (Spark ML)
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames à Distributed collection of data organized into named columns à Conceptually equivalent to a table in relational DB or a data frame in R/Python à API available in Scala, Java, Python, and R Col1 Col2 … … ColN DataFrame Column Row Data is described as a DataFrame with rows, columns, and a schema
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sources CSVAvro HIVE Spark SQL Col1 Col2 … … ColN DataFrame Column Row JSON
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Create a DataFrame val path = "examples/flights.json" val flights = spark.read.json(path) Example
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rapid Ecosystem Adoption: 210+ Processors HTTP Syslog Email HTML Image Hash Encrypt Extract TailMerge Evaluate Duplicate Execute Scan GeoEnrich Replace ConvertSplit Translate HL7 FTP UDP XML SFTP Route Content Route Context Route Text Control Rate Distribute Load AMQP
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FlowFiles are like HTTP data HTTP Data FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 13 Connection: close Content-Type: text/html Hello world! Standard FlowFile Attributes Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'fileSize’ Value: '23609' FlowFile Attribute Map Content Key: 'filename’ Value: '15650246997242' Key: 'path’ Value: './’ Binary Content * Header Content
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved http://hortonworks.com/blog/hdf-2-0-flow-processing-real-time-tweets-strata-hadoop-slack-tensorflow-phoenix-zeppelin/
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FlowFile: Each piece of "User Data" (i.e., data that the user brings into NiFi for processing and distribution) is referred to as a FlowFile. A FlowFile is made up of two parts: Attributes and Content. The Content is the User Data itself. Attributes are key-value pairs that are associated with the User Data. Processor: The Processor is the NiFi component that is responsible for creating, sending, receiving, transforming, routing, splitting, merging, and processing FlowFiles. It is the most important building block available to NiFi users to build their dataflows. https://nifi.apache.org/docs/nifi-docs/html/getting-started.html https://nifi.apache.org/docs/nifi-docs/html/overview.html http://www.slideshare.net/aldrinpiri/apache-nifi-crash-course-san-jose-hadoop-summit-66967077 https://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi/ Quick Terms and Reference
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved InvokeHttp GetTwitter Input
  41. 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RouteOnAttribute ExecuteStreamCommand UpdateAttribute Processing
  42. 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved PutHDFS: Have access to your Hadoop HDFS from the NIFI box and have this configuration: /etc/hadoop/conf/core-site.xml . Also create a directory to use like hdfs dfs –mkdir /nifi-place PutSQL: Create a connection pool, know your JDBC information. 1. Phoenix: URL= jdbc:phoenix:clusterzookeeper:2181:/hbase-unsecure org.apache.phoenix.jdbc.PhoenixDriver file:///opt/demo/phoenix-client.jar User= root pool=2 You will need the JDBC JAR on the local file system. 2. MySQL: jdbc:mysql://tspanndev11.field.hortonworks.com:3306/datacom.mysql.jdbc.Driver/usr/share/java/mysql -connector-java.jar PutSlack Need to get your webhook URL from your slack site. You can go to slack.com and get your own free room to test with. https://api.slack.com/incoming-webhooks Output
  43. 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved python classify_image.py --image_file /opt/demo/dronedataold/Bebop2_20160920083655-0400.jpg solar dish, solar collector, solar furnace (score = 0.98316) window screen (score = 0.00196) manhole cover (score = 0.00070) radiator (score = 0.00041) doormat, welcome mat (score = 0.00041) bazel-bin/tensorflow/examples/label_image/label_image -- image=/opt/demo/dronedataold/Bebop2_20160920083655-0400.jpg tensorflow/examples/label_image/main.cc:204] solar dish (577): 0.983162I tensorflow/examples/label_image/main.cc:204] window screen (912): 0.00196204I tensorflow/examples/label_image/main.cc:204] manhole cover (763): 0.000704005I tensorflow/examples/label_image/main.cc:204] radiator (571): 0.000408321I tensorflow/examples/label_image/main.cc:204] doormat (972): 0.000406186 Local TensorFlow via Python or C++ Binary
  44. 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved /opt/demo/sentiment/run.sh python /opt/demo/sentiment/sentiment.py "$@” from nltk.sentiment.vader import SentimentIntensityAnalyzer import sys sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sys.argv[1]) print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format( ss['compound'],ss['neg'],ss['neu'],ss['pos'])) or if ss['compound'] == 0.00: print('Neutral') elif ss['compound'] < 0.00: print ('Negative') else: print('Positive') Local Sentiment Analysis via Python
  45. 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved https://pip.pypa.io/en/latest/installing/ http://www.nltk.org/install.html wget https://bootstrap.pypa.io/get-pip.py python get-pip.py sudo pip install -U nltk sudo pip install -U numpy Installing NLTK for Python 2.7 Installing TensorFlow is a very difficult exercise, after getting NLTK you can start the process. You will need most of the development tools for Python, C, C++, Bezel, Pip and more. A beefy machine with a lot of RAM, CPUs and GPUs would be useful. Check out my install article for a guide: https://dzone.com/articles/deep-learning-resources Installing TensorFlow for Python 2.7
  46. 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results
  47. 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Results
  48. 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Installation Download the binary from here: http://hortonworks.com/downloads/#dataflow Or here: https://nifi.apache.org/download.html Or on Mac: brew install nifi https://nifi.apache.org/docs/nifi-docs/html/getting- started.html#starting-nifi bin/nifi.sh start bin/nifi.sh install (now it’s installed as a service on Linux)
  49. 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Contact: Timothy Spann @PaaSDeV www.meetup.com/futureofdata-princeton community.hortonworks.com/users/9304/tspann.html http://www.coreservlets.com/hadoop-tutorial/ https://www.slideshare.net/HadoopSummit/hadoop-crash-course
  50. 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection Read access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
  51. 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Community Engagement Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved 4,000+ Registered Users 10,000+ Answers 15,000+ Technical Assets One Website!

×