Learn how to get started with Big Data using a platform based on Apache Hadoop, Apache Spark, and IBM BigInsights technologies. The emphasis here is on free or low-cost options that require modest technical skills.
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
1. Get Started with Big Data Analytics
Cynthia Saracco (Saracco@us.ibm.com), Session #1031
2. Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole
discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in
making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any
material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual
throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the
amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
3. Executive summary
• Big Data analytics growing rapidly across geo’s, industries
• Core open source technologies often include Hadoop, Spark
• Common challenges
– Getting started and growing skills
– Demonstrating value quickly
• Focus of this talk:
– Using open source and IBM technologies for Big Data analytics (on cloud or
on premise installations)
– Emphasis on free services and software with modest skill requirements –
available on cloud or on-premise installations
4. • The big picture on Big Data: growth projections, applications, challenges
• Understanding IBM’s approach: combining open source and IBM-specific technologies
with IBM BigInsights
• How to get started?
– Cloud, on-premise installation options
– Managing your cluster
– Storing, querying, and analyzing your data with IBM Big SQL (easy on-ramp to Hadoop for SQL
professionals)
– Exploring your data without writing code using IBM BigSheets
– . . .
• Summary / resources
Agenda
5. The big picture on Big Data
Opportunities, requirements, applications . . . .
6. Business leaders frequently make decisions
based on information they don’t trust, or don’t
have1in3
83%
of CIOs cited “Business intelligence and
analytics” as part of their visionary plans
to enhance competitiveness
Business leaders say they don’t have access to
the information they need to do their jobs
1in2
of CEOs need to do a better job capturing and
understanding information rapidly in order to
make swift business decisions60%
2.5 million items per
minute
300,000 tweets per
minute
200 million emails
per minute 220,000 photos per
minute
5 TB per flight
> 1 PB per day gas
turbines
Big Data presents new opportunities for insights . . .
7. What we hear from customers . . . .
• Lots of potentially valuable data is dormant or discarded
due to size/performance considerations
• Large volume of unstructured or semi-structured data is not
worth integrating fully (e.g. Tweets, logs, . . .)
• Not clear what should be analyzed (exploratory, iterative)
• Information distributed across multiple systems and/or
Internet
• Some information has a short useful lifespan
• Volumes can be extremely high
• Analysis needed in the context of existing information (not
stand alone)
9. IBM Big Data customer scenarios (Hadoop-based)
• Applications
– Data warehouse integration
– Cloud-based analytics
– Telematics
– Targeted marketing campaigns
– Optimization of capital investments
– . . .
• Industries
– Insurance
– Travel
– Entertainment
– Energy
– Technology
– Banking
– . . . .
8
http://www.ibm.com/analytics/us/en/case-studies.html#topic=hadoop
https://developer.ibm.com/hadoop/blog/2015/11/03/biginsights-and-big-sql-customer-use-cases/
10. IBM’s Big Data approach
Leveraging Hadoop, Spark, and IBM technologies
11. IBM analytics platform strategy for Big Data
• Integrate and manage
the full variety, velocity
and volume of Big Data
• Apply advanced analytics
• Visualize all available
data for ad-hoc analysis
• Support workload
optimization and
scheduling
• Provide for security and
governance
• Integrate with enterprise
software
Discovery
& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data
Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data
Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System
Machine Learning
On premises
On cloud
Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
12. Warehousing Zone
Enterprise
Warehouse
Data Marts
Ingestion and Real-time Analytic Zone
Streams
Connectors
BI & Reporting
Predictive
Analytics
Analytics and Reporting
Zone
Visualization &
Discovery
Landing and Analytics Sandbox Zone
Hive/HBase
Col Stores
Documents
in variety of formats
MapReduce
Hadoop
Metadata and Governance Zone
ETL, MDM, Data Governance
Hadoop and the enterprise
13. An open portfolio of self-service, composable data and analytic services for the
developer, data science professional, and analytic architect. We help transform
businesses and organizations to build applications and gain new insights
better and faster.
Comprehensive TrustedFlexible
• Broadest selection of data and
analytic services available on
multiple cloud platforms
• Pre-built integrations across the
portfolio
• Integrated with open data to gain
deeper insights
• Fully managed: 24 x 7
• Secure infrastructure
• Mitigate risk and lower costs
• Open-sourced driven innovation
• Industry leading support for hybrid
deployments
• Bare metal, virtual, pay-as-you-go and
reserved
Cloud Data Services is Open For Data
14. IBM BigInsights for Apache Hadoop
Discovery
& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data
Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data
Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System
Machine Learning
On premises
On cloud
Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
§ Analytical platform for
persistent Big Data
– 100% open source core with
IBM add-ons for analysts, data
scientists, and admins
– Includes Hadoop and Spark
– On premise or cloud
§ Distinguishing
characteristics
– Built-in analytics . . . .
Enhances business
knowledge
– Enterprise software
integration . . . . Complements
and extends existing
capabilities
– Production-ready . . . .
Speeds time-to-value
§ IBM advantage
– Combination of software,
hardware, services and
research
15. Text Analytics
POSIX Distributed Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights Enterprise
Management
Machine Learning on Big R
Big R (R support)
IBM Open Platform with Apache Hadoop
(HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Ka?a, Knox, Oozie, Pig,
Slider, Solr, Spark, Sqoop, Zookeeper)
IBM BigInsights Data
Scientist
IBM BigInsights Analyst
Big SQL
BigSheets
Industry standard SQL (Big
SQL)
Spreadsheet-style tool
(BigSheets)
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist features
• Community support
. . .
16. How to get started
Acquiring an environment: cloud, on premise options
17. Options for accessing IBM BigInsights
• Cloud options
– Bluemix: http://bluemix.net
– IMDemo cloud (technical previews): http://bigsql.imdemocloud.com
• Installations
– VMware image
– Docker image
– Install image for your own cluster
• Options differ somewhat in breadth of features, privileges available
– Check documentation available through from download site
– Support available via forums
18. http://bluemix.net
• Prototype, demo, trial in the cloud
• Empowers developers to rapidly drive
insight from all data
• Adds Hadoop-based analytics to your
application
• Enterprise features - BigSheets, Big SQL,
Text analytics, HiveQL, HttpFS
• Delivered via IBM BlueMix. To be
decommissioned shortly.
• Production deployments at scale in the cloud
• Delivers flexibility and efficiency with subscription
pricing
• Scales to meet spikes in demand without on-
premise infrastructure
• Drives enterprise-class, complex analytics on Big
Data sets
• Available via the IBM Cloud Marketplace and
Bluemix
Cloud options (Bluemix)
Developer sandbox
Analytics for Hadoop
http://www.ibm.com/cloud
http://www.bluemix.net
BigInsights for Apache Hadoop
Production environment
20. IBM BigInsights on Cloud (Bluemix subscription)
Secure, Dedicated Bare-metal
Infrastructure
IBM Open Platform
Small Nodes
Basic data extraction, transformation, file
processing, search
20 cores, 64 GB RAM, 20 TB raw data
disks (~6 TB usable), 8 TB OS disks, 10 Gb
network
Medium Nodes
Data warehouse optimization – store new
data or extend warehouse
20 cores, 128 GB RAM, 28 TB raw data
disks (~9 TB usable), 8 TB OS disks, 10
Gb network
Large Nodes
Advanced analytics – intensive data
processing
24 cores, 256 GB RAM, 32 TB raw data
disk (~10 TB usable), 8 TB OS disks, 10
Gb network
21. 20
IMDemo Cloud sandbox with technical previews
To register for free use, visit http://bigsql.imdemocloud.com
22. Text Analytics
POSIX Distributed Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights Enterprise
Management
Machine Learning on Big R
Big R (R support)
IBM Open Platform with Apache Hadoop*
(HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Ka?a, Knox, Oozie, Pig,
Slider, Solr, Spark, Sqoop, Zookeeper)
IBM BigInsights Data
Scientist
IBM BigInsights Analyst
Big SQL
BigSheets
Industry standard SQL (Big
SQL)
Spreadsheet-style tool
(BigSheets)
On-premise options: native install, VMWare, Docker
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist features
• Community support
. . .
IBM BigInsights for Apache
Hadoop
23. Where to download images
• Download Quick Start offering
• Links available from HadoopDev (“try it for free”)
– https://developer.ibm.com/hadoop/
24. Looking for data? Look to Bluemix . . . .
• IBM Analytics Exchange: new publicly accessible catalog with > 150
data sets.
• Part of IBM’s Open for Data initiative
26. • Inspect status, start/stop services, etc.
• Launch via Web browser (or cloud-specific link), e.g. http://
myhost.ibm.com:8080
Ambari console
25
27. How to get started
Storing, querying, and analyzing data with Big SQL
28. Overview of SQL for Hadoop (Big SQL)
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server client
SQL MPP Run-time
DFS
27
§ Comprehensive, standard SQL
– SELECT: joins, unions, aggregates, subqueries . . .
– GRANT/REVOKE, INSERT … INTO
– Procedural logic in SQL
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
§ Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes without
persisting intermediate results
– In-memory operations with ability to spill to disk (useful for
aggregations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
§ Various storage formats supported
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
§ Integration with RDBMSs via LOAD, query federation
BigInsights
29. • Command-line interface: Java SQL Shell (JSqsh)
• Web tooling (Data Server Manager)
• Tools that support IBM JDBC/ODBC driver
Invocation options
30. Creating a Big SQL table
• Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null
primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in DFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
31. Results from previous CREATE TABLE . . .
• Data stored in subdirectory of Hive warehouse
. . . /hive/warehouse/myid.db/users
– Default schema is user ID. Can create new schemas
– “Table” is just a subdirectory under schema.db
– Table’s data are files within table subdirectory
• Meta data collected (Big SQL & Hive)
– SYSCAT.* and SYSHADOOP.* views
• Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL
schema over existing DFS directory contents
– Useful if table contents already in DFS
– Avoids need to LOAD data
32. Populating Tables via LOAD
• Typically best runtime performance
• Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with
SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
• Load data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
33. Querying your Big SQL tables
• Same as ISO-compliant RDBMS
• No special query syntax for Hadoop tables
– Projections, restrictions
– UNION, INTERSECT, EXCEPT
– Wide range of built-in functions (e.g. OLAP)
– Full support for subqueries
– All standard join operations
– . . .
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
34. Accessing Big SQL data from Spark shell
// based on BigInsights 4.1, which includes Spark 1.5.1
// establish a Hive context
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// query some Big SQL data
val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact")
// action on the data – count # of rows
saleFacts.count()
. . .
// transform the data as needed (create a Vector with data from 2 cols)
val subset = saleFacts.map {row =>
Vectors.dense(row.getDouble(16),row.getDouble(17))}
// invoke basic Spark MLlib statistical function over the data
val stats = Statistics.colStats(subset)
// print one of the statistics collected
println(stats.mean)
35. A word about . . . SerDes
• Custom serializers / deserializers (SerDes)
– Read / write complex or “unusual” data formats (e.g., JSON)
– Commonly used by Hadoop community
– Developed by user or available publicly
• Add SerDes to directories; reference SerDe when creating table
-- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe
-- Location clause points to DFS dir containing JSON data
-- External clause means DFS dir & data won’t be drop after DROP TABLE command
create external hadoop table socialmedia-json (Country varchar(20), . . . )
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '</hdfs_path>/myJSON';
select * from socialmedia-json;
36. Sample JSON input for previous example
JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
37. Sample Big SQL query output for JSON data
Sample output: Select * from socialmedia-json
38. How to get started
Exploring your data without writing code using BigSheets
39. Spreadsheet-style analysis (BigSheets)
• Web-based analysis and
visualization
• Spreadsheet-like interface
– Explore, manipulate data
without writing code
– Invoke pre-built functions
– Generate charts
– Export results of analysis
– Create custom plug-ins
– . . .
40. Working with BigSheets
• Create workbook for data in DFS
• Customize workbook through graphical
editor and built-in functions
– Filter data
– Apply functions / macros / formulas
– Combine data from multiple workbooks
• “Run” workbook: apply work to full data
set
• Explore results in spreadsheet format
and/or create charts
• Optionally, export your data
Builder
Front End
Evaluation Service
Simulation
PIG Results
Model Model w/ Data
Full Execution
42. Summary
• Big Data analytics in high demand
– Open source technologies (e.g., Apache Hadoop, Spark)
– Vendor-specific analytic tools, engines, and applications
• Multiple options to build Big Data skills with IBM BigInsights
– Cloud: Bluemix, IMDemo cloud (tech previews)
– VMWare / Docker images for your laptop (free download)
– IBM BigInsights Quick Start edition native installation (free download)
4
1
43. Hadoop Dev: developer site for IBM BigInsights
Downloads, forums, labs, papers, etc on Hadoop Dev
https://developer.ibm.com/hadoop/
44. Thank You
Your Feedback is Important!
Access the InterConnect 2016 Conference Attendee
Portal to complete your session surveys from your
smartphone,
laptop or conference kiosk.