Your SlideShare is downloading. ×
Hadoop and its Ecosystem Components in Action
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop and its Ecosystem Components in Action

1,863
views

Published on

SQL Server Live! Orlando 2012

SQL Server Live! Orlando 2012


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,863
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SQL Server Live! Orlando 2012 Hadoop and its Ecosystem Components in Action Andrew Brust CEO and Founder Blue Badge Insights Level: Intermediate Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrustSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 1
  • 2. SQL Server Live! Orlando 2012 My New Blog (bit.ly/bigondata) MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper OutputSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 2
  • 3. SQL Server Live! Orlando 2012 A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Collect the tallies • Merge tallies into one spreadsheet What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – Except the name node = SPOF! • BUT: HDFS is immutable – Files can only be written to once – So updates require drop + re-write (slow)SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 3
  • 4. SQL Server Live! Orlando 2012 Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFSSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 4
  • 5. SQL Server Live! Orlando 2012 Ways to work • Amazon Web Services Elastic MapReduce – Create AWS account – Select Elastic MapReduce in Dashboard • Microsoft Hadoop on Azure – Visit www.hadooponazure.com – Request invite • Cloudera CDH VM image – Download – Run via VMWare Player Amazon Elastic MapReduce • Lots of steps! • At a high level: – Setup AWS account and S3 “buckets” – Generate Key Pair and PEM file – Install Ruby and EMR Command Line Interface – Provision the cluster using CLI – Setup and run SSH/PuTTY – Work interactively at command lineSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 5
  • 6. SQL Server Live! Orlando 2012 Amazon EMR – Prep Steps • Create an AWS account • Create an S3 bucket for log storage – with list permissions for authenticated users • Create a Key Pair and save PEM file • Install Ruby • Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI  • Create credentials.json in EMR CLI folder – Associate with same region as where key pair created Amazon – Security and Startup • Security – Download PuTTYgen and run it – Click Load and browse to PEM file – Save it in PPK format – Exit PuTTYgen • In a command window, navigate to EMR CLI folder and enter command: – ruby elastic-mapreduce --create --alive [--num-instance xx] [--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large] • In AWS Console, go to EC2 Dashboard and click Instances on left nav bar • Wait until instance is running and get its Public DNS name – Use Compatibility View in IE or copy may not workSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 6
  • 7. SQL Server Live! Orlando 2012 Connect! • Download and run PuTTY • Paste DNS name of EC2 instance into hostname field • In Treeview, drill down and navigate to ConnectionSSHAuth, browse to PPK file • Once EC2 instance(s) running, click Open • Click Yes to “The server’s host key is not cached in the registry…” PuTTY Security Alert • When prompted for user name, type “hadoop” and hit Enter • cd bin, then hive, pig, hbase shell • Right-click to paste from clipboard; option to go full-screen • (Kill EC2 instance(s) from Dashboard when done) Amazon Elastic MapReduceSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 7
  • 8. SQL Server Live! Orlando 2012 Microsoft Hadoop on Azure • Much simpler • Browser-based portal – Provisioning cluster, managing ports, MapReduce jobs – External data from Azure BLOB storage • Interactive JavaScript console – HDFS, Pig, light data visualization • Interactive Hive console – Hive commands and metadata discovery • From Portal page you can RDP directly to Hadoop head node – Double click desktop shortcut for CLI access – Certain environment variables may need to be set Microsoft HDInsightSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 8
  • 9. SQL Server Live! Orlando 2012 Hadoop commands • HDFS – hadoop fs filecommand – Create and remove directories: mkdir, rm, rmr – Upload and download files to/from HDFS get, put – View directory contents ls, lsr – Copy, move, view files cp, mv, cat • MapReduce – Run a Java jar-file based job hadoop jar jarname params Hadoop (directly)SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 9
  • 10. SQL Server Live! Orlando 2012 HBase • Concepts: – Tables, column families – Columns, rows – Keys, values • Commands: – Definition: create, alter, drop, truncate – Manipulation: get, put, delete, deleteall, scan – Discovery: list, exists, describe, count – Enablement: disable, enable – Utilities: version, status, shutdown, exit – Reference: http://wiki.apache.org/hadoop/Hbase/Shell HBase Examples • create t1, f1, f2, f3 • describe t1 • alter t1, {NAME => f1, VERSIONS => 5} • put t1, r1, c1, value, ts1 • get t1, r1 • count t1SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 10
  • 11. SQL Server Live! Orlando 2012 HBase Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 11
  • 12. SQL Server Live! Orlando 2012 Hive, Continued • Load data from flat HDFS files – LOAD DATA LOCAL INPATH ./examples/files/kv1.txt‘ OVERWRITE INTO TABLE pokes; • SQL Queries – CREATE, ALTER, DROP – INSERT OVERWRITE (creates whole tables) – SELECT, JOIN, WHERE, GROUP BY – SORT BY, but ordering data is tricky! – USING allows for streaming on map, reduce steps HiveSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 12
  • 13. SQL Server Live! Orlando 2012 Pig • Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions – Do a combo of Query and ETL • “10 lines of Pig Latin ≈ 200 lines of Java.” • Works with structured or unstructured data • Operations – As with Hive, a MapReduce job is generated – Unlike Hive, output is only flat file to HDFS – With MS Hadoop, can easily convert to JavaScript array, then manipulate • Use command line (“Grunt”) or build scripts Example • A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’;SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 13
  • 14. SQL Server Live! Orlando 2012 Pig Latin Examples • Imperative, file system commands – LOAD, STORE Schema specified on LOAD • Declarative, query commands (SQL-like) – xxx = table/file – FOREACH xxx GENERATE (SELECT…FROM xxx) – JOIN (WHERE/INNER JOIN) – FILTER xxx BY (WHERE) – ORDER xxx BY (ORDER BY) – GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) – DISTINCT (SELECT DISTINCT) • Syntax is assignment statement-based: – MyCusts = FILTER Custs BY SalesPerson eq 15 • COGROUP, UDFs PigSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 14
  • 15. SQL Server Live! Orlando 2012 Sqoop sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column> Sqoop sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>"SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 15
  • 16. SQL Server Live! Orlando 2012 Flume NG • Source – Avro (data serialization system – can read json- encoded data files, and can work over RPC) – Exec (reads from stdout of long-running process) • Sinks – HDFS, HBase, Avro • Channels – Memory, JDBC, file Flume NG • Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that weve defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 • From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 16
  • 17. SQL Server Live! Orlando 2012 Mahout Algorithms • Recommendation – Your info + community info – Give users/items/ratings; get user-user/item-item – itemsimilarity • Classification/Categorization – Drop into buckets – Naïve Bayes, Complementary Naïve Bayes, Decision Forests • Clustering – Like classification, but with categories unknown – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean- Shift Workflow, Syntax • Workflow – Run the job – Dump the output – Visualize, predict • mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … • Example: – mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOODSQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 17
  • 18. SQL Server Live! Orlando 2012 Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata Thank you • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Get Blue Badge’s free briefings – Text “bluebadge” to 22828SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 18