0
Open APIs for Open Minds
Building your first application using FI-WARE
Cosmos, Big Data GE implementation
1
Big Data and Open
Data:
What is it and how
much data is there
Big Data and Open Data
2
> open data
> big data
http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_L...
How much data is there?
3
Data growing forecast
4
2.3 3.6
12
19
11.3
39
0.5
1.4
Global
users
(billions)
Global networked
devices
(billions)
Global b...
5
How to deal with it:
the Hadoop reference
Hadoop was created by Doug Cutting at Yahoo!...
6
… based on the MapReduce patent by Google
Well, MapReduce was really invented by Julius Caesar
7
Divide et
impera*
* Divide and
conquer
An example
8
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
LATIN
REF1
P45
GREE...
An example
9
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
REF2
P128
sti...
An example
10
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
LATIN
pages 73
EGY...
An example
11
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
GREEK
GREEK
...
An example
12
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
idle…
idle…
idle…
...
Hadoop architecture
13
head node
14
FI-WARE proposal:
Cosmos Big Data
What is Cosmos?
15
• Cosmos is Telefónica's Big Data and Open Data
asset.
• Cosmos is Hadoop ecosystem-based
• HDFS as its...
Cosmos architecture
16
17
Cluster services:
From WebHDFS to
Cygnus
Storage services within the Infinity cluster
18
Computing services within a private cluster
19
Cygnus, or how to inject context data from Orion CB
20
https://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/Ho...
21
Cosmos open datasets:
Powered by Smart Cities
Open Datasets in Cosmos
22
Source Dataset Data
type
Notes
SmartCities
Málaga Plagues tracking Historical
Santander Smart S...
23
How to create clusters:
Getting your roman
legion
Using the RESTful API (1)
24
Using the RESTful API (2)
25
Using the RESTful API (3)
26
Using the CLI
27
• Creating a cluster
$ cosmos create --name <STRING> --size <INT>
• Listing all the clusters
$ cosmos lis...
28
How to exploit the data:
An incremental
approach
Let’s go step by step…
29
1. Familiarize with Hadoop file system
commands
2. Learn how to use WebHDFS/HttpFS REST
API
3. P...
1. Hadoop filesystem commands
30
• Hadoop general command
$ hadoop
• Hadoop file system subcommand
$ hadoop fs
• Hadoop fi...
2. WebHDFS/HttpFS REST API
31
• List a directory
GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS
• Create a new d...
2. WebHDFS/HttpFS REST API (cont.)
32
• Create a new file with initial content (2 steps operation)
PUT http://<HOST>:<PORT...
2. WebHDFS/HttpFS REST API (cont.)
33
• Open and read a file (2 steps operation)
GET http://<HOST>:<PORT>/webhdfs/v1/<PATH...
3. Local Hive CLI
34
• Hive is a querying tool
• Queries are expresed in HiveQL, a SQL-like
language
• https://cwiki.apach...
3. Local Hive CLI (cont.)
35
• Log on to the Master node
• Run the hive command
• Type your SQL-like sentence!
$ hive
$ Hi...
4. Remote Hive client
36
• Hive CLI is OK for human-driven testing purposes
• But it is not usable by remote applications
...
4. Remote Hive client – Get a connection
37
private Connection getConnection(
String ip, String port, String user, String ...
4. Remote Hive client – Do the query
38
private void doQuery() {
try {
// from here on, everything is SQL!
Statement stmt ...
4. Remote Hive client – Plague Tracker demo
39
https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/pl...
5. MapReduce applications
40
• MapReduce applications are commonly written in Java
• Can be written in other languages thr...
5. MapReduce applications – Map
41
/* org.apache.mapred example */
public static class MapClass extends MapReduceBase impl...
5. MapReduce applications – Reduce
42
/* org.apache.mapred example */
public static class ReduceClass extends MapReduceBas...
5. MapReduce applications – Driver
43
/* org.apache.mapred example */
package my.org
import java.io.IOException;
import ja...
6. Launching tasks with Oozie
44
• Oozie is a workflow scheduler system to manage Hadoop
jobs
• Java map-reduce
• Pig and ...
6. Launching tasks with Oozie – Java client
45
OozieClient client = new OozieClient("http://130.206.80.46:11000/oozie/");
...
Further reading
46
• The datasets are described at:
• http://tinyurl.com/cosmos-datasets
• Hive remote basic client:
• htt...
47
fiware-lab-help@lists.fi-ware.org
frb@tid.es
 http://fi-ppp.eu
 http://fi-ware.eu
 Follow @Fiware on Twitter!
Thanks !
48
Upcoming SlideShare
Loading in...5
×

FI-WARE Cosmos

1,811

Published on

Webinar documentation on Cosmos, the implementation of the Big Data Generic Enabler of FI-WARE.

Published in: Engineering, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,811
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
78
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "FI-WARE Cosmos"

  1. 1. Open APIs for Open Minds Building your first application using FI-WARE Cosmos, Big Data GE implementation
  2. 2. 1 Big Data and Open Data: What is it and how much data is there
  3. 3. Big Data and Open Data 2 > open data > big data http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg
  4. 4. How much data is there? 3
  5. 5. Data growing forecast 4 2.3 3.6 12 19 11.3 39 0.5 1.4 Global users (billions) Global networked devices (billions) Global broadband speed (Mbps) Global traffic (zettabytes) http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast 2012 2012 2012 2012 2017 2017 2017 2017
  6. 6. 5 How to deal with it: the Hadoop reference
  7. 7. Hadoop was created by Doug Cutting at Yahoo!... 6 … based on the MapReduce patent by Google
  8. 8. Well, MapReduce was really invented by Julius Caesar 7 Divide et impera* * Divide and conquer
  9. 9. An example 8 How much pages are written in latin among the books in the Ancient Library of Alexandria? LATIN REF1 P45 GREEK REF2 P128 EGYPT REF3 P12 LATIN pages 45 EGYPTIA N LATIN REF4 P73 LATIN REF5 P34 EGYPT REF6 P10 GREEK REF7 P20 GREEK REF8 P230 45 (ref 1) still reading… Mappers Reducer
  10. 10. An example 9 How much pages are written in latin among the books in the Ancient Library of Alexandria? GREEK REF2 P128 still reading… EGYPTIA N LATIN REF4 P73 LATIN REF5 P34 EGYPT REF6 P10 GREEK REF7 P20 GREEK REF8 P230 GREEK 45 (ref 1) Mappers Reducer
  11. 11. An example 10 How much pages are written in latin among the books in the Ancient Library of Alexandria? LATIN pages 73 EGYPTIA N LATIN REF4 P73 LATIN REF5 P34 GREEK REF7 P20 GREEK REF8 P230 LATIN pages 34 45 (ref 1) +73 (ref 4) +34 (ref 5) Mappers Reducer
  12. 12. An example 11 How much pages are written in latin among the books in the Ancient Library of Alexandria? GREEK GREEK GREEK REF7 P20 GREEK REF8 P230 idle… 45 (ref 1) +73 (ref 4) +34 (ref 5) Mappers Reducer
  13. 13. An example 12 How much pages are written in latin among the books in the Ancient Library of Alexandria? idle… idle… idle… 45 (ref 1) +73 (ref 4) +34 (ref 5) 152 TOTAL Mappers Reducer
  14. 14. Hadoop architecture 13 head node
  15. 15. 14 FI-WARE proposal: Cosmos Big Data
  16. 16. What is Cosmos? 15 • Cosmos is Telefónica's Big Data and Open Data asset. • Cosmos is Hadoop ecosystem-based • HDFS as its distributed file system • Hadoop core as its MapReduce engine • HiveQL and Pig for querying the data • Oozie as remote MapReduce jobs and Hive launcher • Plus other proprietary features • Dynamic creation of private computing clusters as a service • Infinity, a cluster for persistent storage • Infinity protocol (secure WebHDFS) • Cygnus, an injector for context data coming from Orion CB
  17. 17. Cosmos architecture 16
  18. 18. 17 Cluster services: From WebHDFS to Cygnus
  19. 19. Storage services within the Infinity cluster 18
  20. 20. Computing services within a private cluster 19
  21. 21. Cygnus, or how to inject context data from Orion CB 20 https://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/How_to_persist_Orion_data_in_Cosmos https://github.com/telefonicaid/fiware-connectors/tree/develop/flume
  22. 22. 21 Cosmos open datasets: Powered by Smart Cities
  23. 23. Open Datasets in Cosmos 22 Source Dataset Data type Notes SmartCities Málaga Plagues tracking Historical Santander Smart Santander Sensorin g Data coming through Orion Context Broker Parque de las Llamas Sensorin g Sevilla Bikes renting Historical Water metering Historical Census Historical Infraestructures Historical Zaragoza Air quality Historical ther Twitter FI-WARE-related tweets Streamin g http://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/FI-WARE_open_datasets_central
  24. 24. 23 How to create clusters: Getting your roman legion
  25. 25. Using the RESTful API (1) 24
  26. 26. Using the RESTful API (2) 25
  27. 27. Using the RESTful API (3) 26
  28. 28. Using the CLI 27 • Creating a cluster $ cosmos create --name <STRING> --size <INT> • Listing all the clusters $ cosmos list • Showing a cluster details $ cosmos show <CLUSTER_ID> • Connecting to the Head Node of a cluster $ cosmos ssh <CLUSTER_ID> • Terminating a cluster $ cosmos terminate <CLUSTER_ID> • Listing available services $ cosmos list-services • Creating a cluster with specific services $ cosmos create --name <STRING> --size <INT> --services <SERVICES_LIST>
  29. 29. 28 How to exploit the data: An incremental approach
  30. 30. Let’s go step by step… 29 1. Familiarize with Hadoop file system commands 2. Learn how to use WebHDFS/HttpFS REST API 3. Play with the local Hive CLI 4. Write your own remote Hive CLI 5. Write your first MapReduce applications 6. Use Oozie to remotely launch MR and Hive tasks
  31. 31. 1. Hadoop filesystem commands 30 • Hadoop general command $ hadoop • Hadoop file system subcommand $ hadoop fs • Hadoop file system options $ hadoop fs –ls $ hadoop fs –mkdir <hdfs-dir> $ hadoop fs –rmr <hfds-file> $ hadoop fs –cat <hdfs-file> $ hadoop fs –put <local-file> <hdfs-dir> $ hadoop fs –get <hdfs-file> <local-dir> • http://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-common/CommandsManual.html
  32. 32. 2. WebHDFS/HttpFS REST API 31 • List a directory GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS • Create a new directory PUT http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>] • Delete a file or directory DELETE http://<host>:<port>/webhdfs/v1/<path>?op=DELETE [&recursive=<true|false>] • Rename a file or directory PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=RENAME&destination=<PATH> • Concat files POST http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CONCAT&sources=<PATHS> • Set permission PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION [&permission=<OCTAL>] • Set owner PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER [&owner=<USER>][&group=<GROUP>]
  33. 33. 2. WebHDFS/HttpFS REST API (cont.) 32 • Create a new file with initial content (2 steps operation) PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE [&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>] [&permission=<OCTAL>][&buffersize=<INT>] HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE... Content-Length: 0 PUT -T <LOCAL_FILE> http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE... • Append to a file (2 steps operation) POST http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersize=<INT>] HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND... Content-Length: 0 POST -T <LOCAL_FILE> http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
  34. 34. 2. WebHDFS/HttpFS REST API (cont.) 33 • Open and read a file (2 steps operation) GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN [&offset=<LONG>][&length=<LONG>][&buffersize=<INT>] HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN... Content-Length: 0 GET http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN... • http://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-hdfs/WebHDFS.html • HttpFS does not redirect to the Datanode but to the HttpFS server, hidding the Datanodes (and saving tens of public IP addresses) • The API is the same • http://hadoop.apache.org/docs/current/hadoop-hdfs- httpfs/index.html
  35. 35. 3. Local Hive CLI 34 • Hive is a querying tool • Queries are expresed in HiveQL, a SQL-like language • https://cwiki.apache.org/confluence/display/Hive/Language Manual • Hive uses pre-defined MapReduce jobs for • Column selection • Fields grouping • Table joining • … • All the data is loaded into Hive tables
  36. 36. 3. Local Hive CLI (cont.) 35 • Log on to the Master node • Run the hive command • Type your SQL-like sentence! $ hive $ Hive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt hive>select column1,column2,otherColumns from mytable where column1='whatever' and columns2 like '%whatever%'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Starting Job = job_201308280930_0953, Tracking URL = http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953 Kill Command = /usr/lib/hadoop/bin/hadoop job - Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_0953 2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0% 2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0% 2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0% 2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33% …
  37. 37. 4. Remote Hive client 36 • Hive CLI is OK for human-driven testing purposes • But it is not usable by remote applications • Hive has no REST API • Hive has several drivers and libraries • JDBC for Java • Python • PHP • ODBC for C/C++ • Thrift for Java and C++ • https://cwiki.apache.org/confluence/display/Hive/HiveClie nt • A remote Hive client usually performs: • A connection to the Hive server • The query execution
  38. 38. 4. Remote Hive client – Get a connection 37 private Connection getConnection( String ip, String port, String user, String password) { try { // dynamically load the Hive JDBC driver Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println(e.getMessage()); return null; } // try catch try { // return a connection based on the Hive JDBC driver, default DB return DriverManager.getConnection("jdbc:hive://" + ip + ":" + port + "/default?user=" + user + "&password=" + password); } catch (SQLException e) { System.out.println(e.getMessage()); return null; } // try catch } // getConnection https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic-client
  39. 39. 4. Remote Hive client – Do the query 38 private void doQuery() { try { // from here on, everything is SQL! Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery("select column1,column2," + "otherColumns from mytable where column1='whatever' and " + "columns2 like '%whatever%'"); // iterate on the result while (res.next()) { String column1 = res.getString(1); Integer column2 = res.getInteger(2); // whatever you want to do with this row, here } // while // close everything res.close(); stmt.close(); con.close(); } catch (SQLException ex) { System.exit(0); } // try catch } // doQuery https://github.com/telefonicaid/fiware- connectors/tree/develop/resources/hive-basic-client
  40. 40. 4. Remote Hive client – Plague Tracker demo 39 https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/plague-tracker
  41. 41. 5. MapReduce applications 40 • MapReduce applications are commonly written in Java • Can be written in other languages through Hadoop Streaming • They are executed in the command line $ hadoop jar <jar-file> <main-class> <input-dir> <output-dir> • A MapReduce job consists of: • A driver, a piece of software where to define inputs, outputs, formats, etc. and the entry point for launching the job • A set of Mappers, given by a piece of software defining its behaviour • A set of Reducers, given by a piece of software defining its behaviour • There are 2 APIS • org.apache.mapred  old one • org.apache.mapreduce  new one • Hadoop is distributed with MapReduce examples • [HADOOP_HOME]/hadoop-examples.jar
  42. 42. 5. MapReduce applications – Map 41 /* org.apache.mapred example */ public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { /* use the input value, the input key is the offset within the file and it is not necessary in this example */ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); /* iterate on the string, getting each word */ while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); /* emit an output (key,value) pair based on the word and 1 */ output.collect(word, one); } // while } // map } // MapClass
  43. 43. 5. MapReduce applications – Reduce 42 /* org.apache.mapred example */ public static class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; /* iterate on all the values and add them */ while (values.hasNext()) { sum += values.next().get(); } // while /* emit an output (key,value) pair based on the word and its count */ output.collect(key, new IntWritable(sum)); } // reduce } // ReduceClass
  44. 44. 5. MapReduce applications – Driver 43 /* org.apache.mapred example */ package my.org import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } // main } // WordCount
  45. 45. 6. Launching tasks with Oozie 44 • Oozie is a workflow scheduler system to manage Hadoop jobs • Java map-reduce • Pig and Hive • Sqoop • System specific jobs (such as Java programs and shell scripts) • Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. • Writting Oozie applications is about including in a package • The MapReduce jobs, Hive/Pig scritps, etc (exeutable code) • A Workflow • Parameters for the Workflow • Oozie can be use locally or remotely • https://oozie.apache.org/docs/4.0.0/index.html#Developer_Do cumentation
  46. 46. 6. Launching tasks with Oozie – Java client 45 OozieClient client = new OozieClient("http://130.206.80.46:11000/oozie/"); // create a workflow job configuration and set the workflow application path Properties conf = client.createConfiguration(); conf.setProperty(OozieClient.APP_PATH, "hdfs://cosmosmaster- gi:8020/user/frb/mrjobs"); conf.setProperty("nameNode", "hdfs://cosmosmaster-gi:8020"); conf.setProperty("jobTracker", "cosmosmaster-gi:8021"); conf.setProperty("outputDir", "output"); conf.setProperty("inputDir", "input"); conf.setProperty("examplesRoot", "mrjobs"); conf.setProperty("queueName", "default"); // submit and start the workflow job String jobId = client.run(conf); // wait until the workflow job finishes printing the status every 10 seconds while (client.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) { System.out.println("Workflow job running ..."); Thread.sleep(10 * 1000); } // while System.out.println("Workflow job completed");
  47. 47. Further reading 46 • The datasets are described at: • http://tinyurl.com/cosmos-datasets • Hive remote basic client: • https://github.com/telefonicaid/fiware- connectors/tree/develop/resources/hive-basic-client • Plague Tracker demo: • https://github.com/telefonicaid/fiware- livedemoapp/tree/master/cosmos/plague-tracker • http://130.206.81.65/plague-tracker/ • More detailed information can be found here: • http://tinyurl.com/cosmos-programmer-guide • http://tinyurl.com/cosmos-apis • http://tinyurl.com/cosmos-architecture
  48. 48. 47 fiware-lab-help@lists.fi-ware.org frb@tid.es
  49. 49.  http://fi-ppp.eu  http://fi-ware.eu  Follow @Fiware on Twitter! Thanks ! 48
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×