SlideShare a Scribd company logo
1 of 44
Download to read offline
Get Started with Big Data Analytics
Cynthia Saracco (Saracco@us.ibm.com), Session #1031
Please Note:
•  IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole
discretion.
•  Information regarding potential future products is intended to outline our general product direction and it should not be relied on in
making a purchasing decision.
•  The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any
material, code or functionality. Information about potential future products may not be incorporated into any contract.
•  The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
•  Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual
throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the
amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Executive summary
•  Big Data analytics growing rapidly across geo’s, industries
•  Core open source technologies often include Hadoop, Spark
•  Common challenges
– Getting started and growing skills
– Demonstrating value quickly
•  Focus of this talk:
– Using open source and IBM technologies for Big Data analytics (on cloud or
on premise installations)
– Emphasis on free services and software with modest skill requirements –
available on cloud or on-premise installations
•  The big picture on Big Data: growth projections, applications, challenges
•  Understanding IBM’s approach: combining open source and IBM-specific technologies
with IBM BigInsights
•  How to get started?
–  Cloud, on-premise installation options
–  Managing your cluster
–  Storing, querying, and analyzing your data with IBM Big SQL (easy on-ramp to Hadoop for SQL
professionals)
–  Exploring your data without writing code using IBM BigSheets
–  . . .
•  Summary / resources
Agenda
The big picture on Big Data
Opportunities, requirements, applications . . . .
Business leaders frequently make decisions
based on information they don’t trust, or don’t
have1in3
83%
of CIOs cited “Business intelligence and
analytics” as part of their visionary plans
to enhance competitiveness
Business leaders say they don’t have access to
the information they need to do their jobs
1in2
of CEOs need to do a better job capturing and
understanding information rapidly in order to
make swift business decisions60%
2.5 million items per
minute
300,000 tweets per
minute
200 million emails
per minute 220,000 photos per
minute
5 TB per flight
> 1 PB per day gas
turbines
Big Data presents new opportunities for insights . . .
What we hear from customers . . . .
•  Lots of potentially valuable data is dormant or discarded
due to size/performance considerations
•  Large volume of unstructured or semi-structured data is not
worth integrating fully (e.g. Tweets, logs, . . .)
•  Not clear what should be analyzed (exploratory, iterative)
•  Information distributed across multiple systems and/or
Internet
•  Some information has a short useful lifespan
•  Volumes can be extremely high
•  Analysis needed in the context of existing information (not
stand alone)
Big Data in practice
IBM Big Data customer scenarios (Hadoop-based)
•  Applications
–  Data warehouse integration
–  Cloud-based analytics
–  Telematics
–  Targeted marketing campaigns
–  Optimization of capital investments
–  . . .
•  Industries
–  Insurance
–  Travel
–  Entertainment
–  Energy
–  Technology
–  Banking
–  . . . .
8
http://www.ibm.com/analytics/us/en/case-studies.html#topic=hadoop
https://developer.ibm.com/hadoop/blog/2015/11/03/biginsights-and-big-sql-customer-use-cases/
IBM’s Big Data approach
Leveraging Hadoop, Spark, and IBM technologies
IBM analytics platform strategy for Big Data
•  Integrate and manage
the full variety, velocity
and volume of Big Data
•  Apply advanced analytics
•  Visualize all available
data for ad-hoc analysis
•  Support workload
optimization and
scheduling
•  Provide for security and
governance
•  Integrate with enterprise
software
Discovery

& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data

Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data

Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System

Machine Learning
On premises
 On cloud
Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
Warehousing Zone
Enterprise
Warehouse
Data Marts
Ingestion and Real-time Analytic Zone
Streams
Connectors
BI & Reporting
Predictive
Analytics
Analytics and Reporting
Zone
Visualization &
Discovery
Landing and Analytics Sandbox Zone
Hive/HBase	
Col	Stores	
Documents
in variety of formats
MapReduce
Hadoop
Metadata and Governance Zone
ETL, MDM, Data Governance
Hadoop and the enterprise
An open portfolio of self-service, composable data and analytic services for the
developer, data science professional, and analytic architect. We help transform
businesses and organizations to build applications and gain new insights
better and faster.
Comprehensive TrustedFlexible
•  Broadest selection of data and
analytic services available on
multiple cloud platforms
•  Pre-built integrations across the
portfolio
•  Integrated with open data to gain
deeper insights
•  Fully managed: 24 x 7
•  Secure infrastructure
•  Mitigate risk and lower costs
•  Open-sourced driven innovation
•  Industry leading support for hybrid
deployments
•  Bare metal, virtual, pay-as-you-go and
reserved
Cloud Data Services is Open For Data
IBM BigInsights for Apache Hadoop
Discovery

& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data

Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data

Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System

Machine Learning
On premises
 On cloud
Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
§  Analytical platform for
persistent Big Data
–  100% open source core with
IBM add-ons for analysts, data
scientists, and admins
–  Includes Hadoop and Spark
–  On premise or cloud
§  Distinguishing
characteristics
–  Built-in analytics . . . .
Enhances business
knowledge
–  Enterprise software
integration . . . . Complements
and extends existing
capabilities
–  Production-ready . . . .
Speeds time-to-value
§  IBM advantage
–  Combination of software,
hardware, services and
research
Text Analytics
POSIX Distributed Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights Enterprise
Management
Machine Learning on Big R
Big R (R support)
IBM Open Platform with Apache Hadoop
(HDFS,	YARN,	MapReduce,	Ambari,	Flume,	HBase,	Hive,	Ka?a,	Knox,	Oozie,		Pig,		
Slider,	Solr,	Spark,	Sqoop,	Zookeeper)
IBM BigInsights Data
Scientist
IBM BigInsights Analyst
Big SQL
BigSheets
Industry standard SQL (Big
SQL)
Spreadsheet-style tool
(BigSheets)
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist features
• Community support
. . .
How to get started
Acquiring an environment: cloud, on premise options
Options for accessing IBM BigInsights
•  Cloud options
– Bluemix: http://bluemix.net
– IMDemo cloud (technical previews): http://bigsql.imdemocloud.com
•  Installations
– VMware image
– Docker image
– Install image for your own cluster
•  Options differ somewhat in breadth of features, privileges available
– Check documentation available through from download site
– Support available via forums
http://bluemix.net
•  Prototype, demo, trial in the cloud
•  Empowers developers to rapidly drive
insight from all data
•  Adds Hadoop-based analytics to your
application
•  Enterprise features - BigSheets, Big SQL,
Text analytics, HiveQL, HttpFS
•  Delivered via IBM BlueMix. To be
decommissioned shortly.
•  Production deployments at scale in the cloud
•  Delivers flexibility and efficiency with subscription
pricing
•  Scales to meet spikes in demand without on-
premise infrastructure
•  Drives enterprise-class, complex analytics on Big
Data sets
•  Available via the IBM Cloud Marketplace and
Bluemix
Cloud options (Bluemix)
Developer sandbox
Analytics for Hadoop
http://www.ibm.com/cloud
http://www.bluemix.net
BigInsights for Apache Hadoop
Production environment
Bluemix sandbox service
To be decommissioned shortly; pay-as-you-go offering in closed beta
IBM BigInsights on Cloud (Bluemix subscription)
Secure, Dedicated Bare-metal
Infrastructure
IBM Open Platform
Small Nodes
Basic data extraction, transformation, file
processing, search
20 cores, 64 GB RAM, 20 TB raw data
disks (~6 TB usable), 8 TB OS disks, 10 Gb
network
Medium Nodes
Data warehouse optimization – store new
data or extend warehouse
20 cores, 128 GB RAM, 28 TB raw data
disks (~9 TB usable), 8 TB OS disks, 10
Gb network
Large Nodes
Advanced analytics – intensive data
processing
24 cores, 256 GB RAM, 32 TB raw data
disk (~10 TB usable), 8 TB OS disks, 10
Gb network
20
IMDemo Cloud sandbox with technical previews
To register for free use, visit http://bigsql.imdemocloud.com
Text Analytics
POSIX Distributed Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights Enterprise
Management
Machine Learning on Big R
Big R (R support)
IBM Open Platform with Apache Hadoop*
(HDFS,	YARN,	MapReduce,	Ambari,	Flume,	HBase,	Hive,	Ka?a,	Knox,	Oozie,		Pig,		
Slider,	Solr,	Spark,	Sqoop,	Zookeeper)
IBM BigInsights Data
Scientist
IBM BigInsights Analyst
Big SQL
BigSheets
Industry standard SQL (Big
SQL)
Spreadsheet-style tool
(BigSheets)
On-premise options: native install, VMWare, Docker
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist features
• Community support
. . .
IBM BigInsights for Apache
Hadoop
Where to download images
•  Download Quick Start offering
•  Links available from HadoopDev (“try it for free”)
–  https://developer.ibm.com/hadoop/
Looking for data? Look to Bluemix . . . .
•  IBM Analytics Exchange: new publicly accessible catalog with > 150
data sets.
•  Part of IBM’s Open for Data initiative
How to get started
Managing your cluster
•  Inspect status, start/stop services, etc.
•  Launch via Web browser (or cloud-specific link), e.g. http://
myhost.ibm.com:8080
Ambari console
25
How to get started
Storing, querying, and analyzing data with Big SQL
Overview of SQL for Hadoop (Big SQL)
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server client
SQL MPP Run-time
DFS
27
§  Comprehensive, standard SQL
–  SELECT: joins, unions, aggregates, subqueries . . .
–  GRANT/REVOKE, INSERT … INTO
–  Procedural logic in SQL
–  Stored procs, user-defined functions
–  IBM data server JDBC and ODBC drivers
§  Optimization and performance
–  IBM MPP engine (C++) replaces Java MapReduce layer
–  Continuous running daemons (no start up latency)
–  Message passing allow data to flow between nodes without
persisting intermediate results
–  In-memory operations with ability to spill to disk (useful for
aggregations, sorts that exceed available RAM)
–  Cost-based query optimization with 140+ rewrite rules
§  Various storage formats supported
–  Data persisted in DFS, Hive, HBase
–  No IBM proprietary format required
§  Integration with RDBMSs via LOAD, query federation
BigInsights
•  Command-line interface: Java SQL Shell (JSqsh)
•  Web tooling (Data Server Manager)
•  Tools that support IBM JDBC/ODBC driver
Invocation options
Creating a Big SQL table
•  Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null
primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in DFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
Results from previous CREATE TABLE . . .
•  Data stored in subdirectory of Hive warehouse
.	.	.	/hive/warehouse/myid.db/users		
– Default schema is user ID. Can create new schemas
– “Table” is just a subdirectory under schema.db
– Table’s data are files within table subdirectory
•  Meta data collected (Big SQL & Hive)
– SYSCAT.* and SYSHADOOP.* views
•  Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL
schema over existing DFS directory contents
– Useful if table contents already in DFS
– Avoids need to LOAD data
Populating Tables via LOAD
•  Typically best runtime performance
•  Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with
SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
•  Load data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
Querying your Big SQL tables
•  Same as ISO-compliant RDBMS
•  No special query syntax for Hadoop tables
– Projections, restrictions
– UNION, INTERSECT, EXCEPT
– Wide range of built-in functions (e.g. OLAP)
– Full support for subqueries
– All standard join operations
– . . .
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
Accessing Big SQL data from Spark shell
// based on BigInsights 4.1, which includes Spark 1.5.1
// establish a Hive context
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// query some Big SQL data
val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact")
// action on the data – count # of rows
saleFacts.count()
. . .
// transform the data as needed (create a Vector with data from 2 cols)
val subset = saleFacts.map {row =>
Vectors.dense(row.getDouble(16),row.getDouble(17))}
// invoke basic Spark MLlib statistical function over the data
val stats = Statistics.colStats(subset)
// print one of the statistics collected
println(stats.mean)
A word about . . . SerDes
•  Custom serializers / deserializers (SerDes)
– Read / write complex or “unusual” data formats (e.g., JSON)
– Commonly used by Hadoop community
– Developed by user or available publicly
•  Add SerDes to directories; reference SerDe when creating table
-- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe
-- Location clause points to DFS dir containing JSON data
-- External clause means DFS dir & data won’t be drop after DROP TABLE command
create external hadoop table socialmedia-json (Country varchar(20), . . . )
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '</hdfs_path>/myJSON';
select * from socialmedia-json;
Sample JSON input for previous example
JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
Sample Big SQL query output for JSON data
Sample output: Select * from socialmedia-json
How to get started
Exploring your data without writing code using BigSheets
Spreadsheet-style analysis (BigSheets)
•  Web-based analysis and
visualization
•  Spreadsheet-like interface
– Explore, manipulate data
without writing code
– Invoke pre-built functions
– Generate charts
– Export results of analysis
– Create custom plug-ins
– . . .
Working with BigSheets
•  Create workbook for data in DFS
•  Customize workbook through graphical
editor and built-in functions
– Filter data
– Apply functions / macros / formulas
– Combine data from multiple workbooks
•  “Run” workbook: apply work to full data
set
•  Explore results in spreadsheet format
and/or create charts
•  Optionally, export your data
Builder
Front End
Evaluation Service
Simulation
PIG Results
Model Model w/ Data
Full Execution
Summary and resources
Discover how you can take the next step with Big Data
Summary
•  Big Data analytics in high demand
– Open source technologies (e.g., Apache Hadoop, Spark)
– Vendor-specific analytic tools, engines, and applications
•  Multiple options to build Big Data skills with IBM BigInsights
– Cloud: Bluemix, IMDemo cloud (tech previews)
– VMWare / Docker images for your laptop (free download)
– IBM BigInsights Quick Start edition native installation (free download)
4
1
Hadoop Dev: developer site for IBM BigInsights
Downloads, forums, labs, papers, etc on Hadoop Dev
https://developer.ibm.com/hadoop/
Thank You
Your Feedback is Important!
Access the InterConnect 2016 Conference Attendee
Portal to complete your session surveys from your
smartphone,
laptop or conference kiosk.

More Related Content

What's hot

Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Nicolas Morales
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark Cynthia Saracco
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Nicolas Morales
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab Cynthia Saracco
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM Analytics
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Big Data: Get started with SQL on Hadoop self-study lab
Big Data:  Get started with SQL on Hadoop self-study lab Big Data:  Get started with SQL on Hadoop self-study lab
Big Data: Get started with SQL on Hadoop self-study lab Cynthia Saracco
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase Cynthia Saracco
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 

What's hot (18)

Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Big Data: Get started with SQL on Hadoop self-study lab
Big Data:  Get started with SQL on Hadoop self-study lab Big Data:  Get started with SQL on Hadoop self-study lab
Big Data: Get started with SQL on Hadoop self-study lab
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 

Viewers also liked

Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Generating Big Value from Big Data
Generating Big Value from Big DataGenerating Big Value from Big Data
Generating Big Value from Big DataBrendan Aldrich
 
InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)
InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)
InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)Kevin Sutter
 
IBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platformsIBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platformsMarkTaylorIBM
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsightsCynthia Saracco
 
MQ Security Overview
MQ Security OverviewMQ Security Overview
MQ Security OverviewMarkTaylorIBM
 
IBM MQ - Monitoring and Managing Hybrid Messaging Environments
IBM MQ - Monitoring and Managing Hybrid Messaging EnvironmentsIBM MQ - Monitoring and Managing Hybrid Messaging Environments
IBM MQ - Monitoring and Managing Hybrid Messaging EnvironmentsMarkTaylorIBM
 
Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data DATAVERSITY
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
IBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platformsIBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platformsMarkTaylorIBM
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performanceMarkTaylorIBM
 
MQ What's New Beyond V8 - V8003 level
MQ What's New Beyond V8 - V8003 levelMQ What's New Beyond V8 - V8003 level
MQ What's New Beyond V8 - V8003 levelMarkTaylorIBM
 
What's new in IBM MQ Messaging
What's new in IBM MQ MessagingWhat's new in IBM MQ Messaging
What's new in IBM MQ MessagingMarkTaylorIBM
 
Understanding mq deployment choices and use cases
Understanding mq deployment choices and use casesUnderstanding mq deployment choices and use cases
Understanding mq deployment choices and use casesLeif Davidsen
 
IBM InterConnect 2016: Security for DevOps in an Enterprise
IBM InterConnect 2016: Security for DevOps in an Enterprise IBM InterConnect 2016: Security for DevOps in an Enterprise
IBM InterConnect 2016: Security for DevOps in an Enterprise Sanjeev Sharma
 
Iib v10 performance problem determination examples
Iib v10 performance problem determination examplesIib v10 performance problem determination examples
Iib v10 performance problem determination examplesMartinRoss_IBM
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryMarkTaylorIBM
 
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Leif Davidsen
 
DevOps & Continuous Test for IIB and IBM MQ
DevOps & Continuous Test for IIB and IBM MQDevOps & Continuous Test for IIB and IBM MQ
DevOps & Continuous Test for IIB and IBM MQStuart Feasey
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseRubedo, a WebTales solution
 

Viewers also liked (20)

Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Generating Big Value from Big Data
Generating Big Value from Big DataGenerating Big Value from Big Data
Generating Big Value from Big Data
 
InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)
InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)
InterConnect 2016, OpenJPA and EclipseLink Usage Scenarios (PEJ-5303)
 
IBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platformsIBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platforms
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
 
MQ Security Overview
MQ Security OverviewMQ Security Overview
MQ Security Overview
 
IBM MQ - Monitoring and Managing Hybrid Messaging Environments
IBM MQ - Monitoring and Managing Hybrid Messaging EnvironmentsIBM MQ - Monitoring and Managing Hybrid Messaging Environments
IBM MQ - Monitoring and Managing Hybrid Messaging Environments
 
Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
IBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platformsIBM MQ - Comparing Distributed and z/OS platforms
IBM MQ - Comparing Distributed and z/OS platforms
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performance
 
MQ What's New Beyond V8 - V8003 level
MQ What's New Beyond V8 - V8003 levelMQ What's New Beyond V8 - V8003 level
MQ What's New Beyond V8 - V8003 level
 
What's new in IBM MQ Messaging
What's new in IBM MQ MessagingWhat's new in IBM MQ Messaging
What's new in IBM MQ Messaging
 
Understanding mq deployment choices and use cases
Understanding mq deployment choices and use casesUnderstanding mq deployment choices and use cases
Understanding mq deployment choices and use cases
 
IBM InterConnect 2016: Security for DevOps in an Enterprise
IBM InterConnect 2016: Security for DevOps in an Enterprise IBM InterConnect 2016: Security for DevOps in an Enterprise
IBM InterConnect 2016: Security for DevOps in an Enterprise
 
Iib v10 performance problem determination examples
Iib v10 performance problem determination examplesIib v10 performance problem determination examples
Iib v10 performance problem determination examples
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
Expanding your options with the IBM MQ Appliance - IBM InterConnect 2016
 
DevOps & Continuous Test for IIB and IBM MQ
DevOps & Continuous Test for IIB and IBM MQDevOps & Continuous Test for IIB and IBM MQ
DevOps & Continuous Test for IIB and IBM MQ
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entreprise
 

Similar to Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics

ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceKaran Sachdeva
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantagePrecisely
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsightsWilfried Hoge
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsYong Feng
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Indrajit Poddar
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Data & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft PlatformsData & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft PlatformsSonata Software
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013IBM Sverige
 
Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 IBM Sverige
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
IBM Spectrum Scale Best Practices for Genomics Medicine Workloads
IBM Spectrum Scale Best Practices for Genomics Medicine WorkloadsIBM Spectrum Scale Best Practices for Genomics Medicine Workloads
IBM Spectrum Scale Best Practices for Genomics Medicine WorkloadsUlf Troppens
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructureinside-BigData.com
 

Similar to Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics (20)

ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
A journey to faster, repeatable data commercialization
A journey to faster, repeatable data commercializationA journey to faster, repeatable data commercialization
A journey to faster, repeatable data commercialization
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Data & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft PlatformsData & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft Platforms
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013New Innovations in Information Management for Big Data - Smarter Business 2013
New Innovations in Information Management for Big Data - Smarter Business 2013
 
Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
IBM Spectrum Scale Best Practices for Genomics Medicine Workloads
IBM Spectrum Scale Best Practices for Genomics Medicine WorkloadsIBM Spectrum Scale Best Practices for Genomics Medicine Workloads
IBM Spectrum Scale Best Practices for Genomics Medicine Workloads
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics

  • 1. Get Started with Big Data Analytics Cynthia Saracco (Saracco@us.ibm.com), Session #1031
  • 2. Please Note: •  IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. •  Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. •  The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. •  The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. •  Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 3. Executive summary •  Big Data analytics growing rapidly across geo’s, industries •  Core open source technologies often include Hadoop, Spark •  Common challenges – Getting started and growing skills – Demonstrating value quickly •  Focus of this talk: – Using open source and IBM technologies for Big Data analytics (on cloud or on premise installations) – Emphasis on free services and software with modest skill requirements – available on cloud or on-premise installations
  • 4. •  The big picture on Big Data: growth projections, applications, challenges •  Understanding IBM’s approach: combining open source and IBM-specific technologies with IBM BigInsights •  How to get started? –  Cloud, on-premise installation options –  Managing your cluster –  Storing, querying, and analyzing your data with IBM Big SQL (easy on-ramp to Hadoop for SQL professionals) –  Exploring your data without writing code using IBM BigSheets –  . . . •  Summary / resources Agenda
  • 5. The big picture on Big Data Opportunities, requirements, applications . . . .
  • 6. Business leaders frequently make decisions based on information they don’t trust, or don’t have1in3 83% of CIOs cited “Business intelligence and analytics” as part of their visionary plans to enhance competitiveness Business leaders say they don’t have access to the information they need to do their jobs 1in2 of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions60% 2.5 million items per minute 300,000 tweets per minute 200 million emails per minute 220,000 photos per minute 5 TB per flight > 1 PB per day gas turbines Big Data presents new opportunities for insights . . .
  • 7. What we hear from customers . . . . •  Lots of potentially valuable data is dormant or discarded due to size/performance considerations •  Large volume of unstructured or semi-structured data is not worth integrating fully (e.g. Tweets, logs, . . .) •  Not clear what should be analyzed (exploratory, iterative) •  Information distributed across multiple systems and/or Internet •  Some information has a short useful lifespan •  Volumes can be extremely high •  Analysis needed in the context of existing information (not stand alone)
  • 8. Big Data in practice
  • 9. IBM Big Data customer scenarios (Hadoop-based) •  Applications –  Data warehouse integration –  Cloud-based analytics –  Telematics –  Targeted marketing campaigns –  Optimization of capital investments –  . . . •  Industries –  Insurance –  Travel –  Entertainment –  Energy –  Technology –  Banking –  . . . . 8 http://www.ibm.com/analytics/us/en/case-studies.html#topic=hadoop https://developer.ibm.com/hadoop/blog/2015/11/03/biginsights-and-big-sql-customer-use-cases/
  • 10. IBM’s Big Data approach Leveraging Hadoop, Spark, and IBM technologies
  • 11. IBM analytics platform strategy for Big Data •  Integrate and manage the full variety, velocity and volume of Big Data •  Apply advanced analytics •  Visualize all available data for ad-hoc analysis •  Support workload optimization and scheduling •  Provide for security and governance •  Integrate with enterprise software Discovery
 & Exploration Prescriptive Analytics Predictive Analytics Content Analytics Business Intelligence Data
 Mgmt Hadoop & NoSQL Content Mgmt Data
 Warehouse Information Integration & Governance IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted. Spark Analytics Operating System
 Machine Learning On premises On cloud Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured
  • 12. Warehousing Zone Enterprise Warehouse Data Marts Ingestion and Real-time Analytic Zone Streams Connectors BI & Reporting Predictive Analytics Analytics and Reporting Zone Visualization & Discovery Landing and Analytics Sandbox Zone Hive/HBase Col Stores Documents in variety of formats MapReduce Hadoop Metadata and Governance Zone ETL, MDM, Data Governance Hadoop and the enterprise
  • 13. An open portfolio of self-service, composable data and analytic services for the developer, data science professional, and analytic architect. We help transform businesses and organizations to build applications and gain new insights better and faster. Comprehensive TrustedFlexible •  Broadest selection of data and analytic services available on multiple cloud platforms •  Pre-built integrations across the portfolio •  Integrated with open data to gain deeper insights •  Fully managed: 24 x 7 •  Secure infrastructure •  Mitigate risk and lower costs •  Open-sourced driven innovation •  Industry leading support for hybrid deployments •  Bare metal, virtual, pay-as-you-go and reserved Cloud Data Services is Open For Data
  • 14. IBM BigInsights for Apache Hadoop Discovery
 & Exploration Prescriptive Analytics Predictive Analytics Content Analytics Business Intelligence Data
 Mgmt Hadoop & NoSQL Content Mgmt Data
 Warehouse Information Integration & Governance IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted. Spark Analytics Operating System
 Machine Learning On premises On cloud Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured §  Analytical platform for persistent Big Data –  100% open source core with IBM add-ons for analysts, data scientists, and admins –  Includes Hadoop and Spark –  On premise or cloud §  Distinguishing characteristics –  Built-in analytics . . . . Enhances business knowledge –  Enterprise software integration . . . . Complements and extends existing capabilities –  Production-ready . . . . Speeds time-to-value §  IBM advantage –  Combination of software, hardware, services and research
  • 15. Text Analytics POSIX Distributed Filesystem Multi-workload, multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R (R support) IBM Open Platform with Apache Hadoop (HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Ka?a, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper) IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Industry standard SQL (Big SQL) Spreadsheet-style tool (BigSheets) Overview of BigInsights Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support . . .
  • 16. How to get started Acquiring an environment: cloud, on premise options
  • 17. Options for accessing IBM BigInsights •  Cloud options – Bluemix: http://bluemix.net – IMDemo cloud (technical previews): http://bigsql.imdemocloud.com •  Installations – VMware image – Docker image – Install image for your own cluster •  Options differ somewhat in breadth of features, privileges available – Check documentation available through from download site – Support available via forums
  • 18. http://bluemix.net •  Prototype, demo, trial in the cloud •  Empowers developers to rapidly drive insight from all data •  Adds Hadoop-based analytics to your application •  Enterprise features - BigSheets, Big SQL, Text analytics, HiveQL, HttpFS •  Delivered via IBM BlueMix. To be decommissioned shortly. •  Production deployments at scale in the cloud •  Delivers flexibility and efficiency with subscription pricing •  Scales to meet spikes in demand without on- premise infrastructure •  Drives enterprise-class, complex analytics on Big Data sets •  Available via the IBM Cloud Marketplace and Bluemix Cloud options (Bluemix) Developer sandbox Analytics for Hadoop http://www.ibm.com/cloud http://www.bluemix.net BigInsights for Apache Hadoop Production environment
  • 19. Bluemix sandbox service To be decommissioned shortly; pay-as-you-go offering in closed beta
  • 20. IBM BigInsights on Cloud (Bluemix subscription) Secure, Dedicated Bare-metal Infrastructure IBM Open Platform Small Nodes Basic data extraction, transformation, file processing, search 20 cores, 64 GB RAM, 20 TB raw data disks (~6 TB usable), 8 TB OS disks, 10 Gb network Medium Nodes Data warehouse optimization – store new data or extend warehouse 20 cores, 128 GB RAM, 28 TB raw data disks (~9 TB usable), 8 TB OS disks, 10 Gb network Large Nodes Advanced analytics – intensive data processing 24 cores, 256 GB RAM, 32 TB raw data disk (~10 TB usable), 8 TB OS disks, 10 Gb network
  • 21. 20 IMDemo Cloud sandbox with technical previews To register for free use, visit http://bigsql.imdemocloud.com
  • 22. Text Analytics POSIX Distributed Filesystem Multi-workload, multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R (R support) IBM Open Platform with Apache Hadoop* (HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Ka?a, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper) IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Industry standard SQL (Big SQL) Spreadsheet-style tool (BigSheets) On-premise options: native install, VMWare, Docker Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support . . . IBM BigInsights for Apache Hadoop
  • 23. Where to download images •  Download Quick Start offering •  Links available from HadoopDev (“try it for free”) –  https://developer.ibm.com/hadoop/
  • 24. Looking for data? Look to Bluemix . . . . •  IBM Analytics Exchange: new publicly accessible catalog with > 150 data sets. •  Part of IBM’s Open for Data initiative
  • 25. How to get started Managing your cluster
  • 26. •  Inspect status, start/stop services, etc. •  Launch via Web browser (or cloud-specific link), e.g. http:// myhost.ibm.com:8080 Ambari console 25
  • 27. How to get started Storing, querying, and analyzing data with Big SQL
  • 28. Overview of SQL for Hadoop (Big SQL) SQL-based Application Big SQL Engine Data Storage IBM data server client SQL MPP Run-time DFS 27 §  Comprehensive, standard SQL –  SELECT: joins, unions, aggregates, subqueries . . . –  GRANT/REVOKE, INSERT … INTO –  Procedural logic in SQL –  Stored procs, user-defined functions –  IBM data server JDBC and ODBC drivers §  Optimization and performance –  IBM MPP engine (C++) replaces Java MapReduce layer –  Continuous running daemons (no start up latency) –  Message passing allow data to flow between nodes without persisting intermediate results –  In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM) –  Cost-based query optimization with 140+ rewrite rules §  Various storage formats supported –  Data persisted in DFS, Hive, HBase –  No IBM proprietary format required §  Integration with RDBMSs via LOAD, query federation BigInsights
  • 29. •  Command-line interface: Java SQL Shell (JSqsh) •  Web tooling (Data Server Manager) •  Tools that support IBM JDBC/ODBC driver Invocation options
  • 30. Creating a Big SQL table •  Standard CREATE TABLE DDL with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null) row format delimited fields terminated by '|' stored as textfile; Worth noting: • “Hadoop” keyword creates table in DFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization)
  • 31. Results from previous CREATE TABLE . . . •  Data stored in subdirectory of Hive warehouse . . . /hive/warehouse/myid.db/users – Default schema is user ID. Can create new schemas – “Table” is just a subdirectory under schema.db – Table’s data are files within table subdirectory •  Meta data collected (Big SQL & Hive) – SYSCAT.* and SYSHADOOP.* views •  Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL schema over existing DFS directory contents – Useful if table contents already in DFS – Avoids need to LOAD data
  • 32. Populating Tables via LOAD •  Typically best runtime performance •  Load data from local or remote file system load hadoop using file url 'sftp://myID:myPassword@myServer.ibm.com:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite; •  Load data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection load hadoop using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE < ''2012-02-01''' into table media_db2table_jan overwrite with load properties ('num.map.tasks' = 10);
  • 33. Querying your Big SQL tables •  Same as ISO-compliant RDBMS •  No special query syntax for Hadoop tables – Projections, restrictions – UNION, INTERSECT, EXCEPT – Wide range of built-in functions (e.g. OLAP) – Full support for subqueries – All standard join operations – . . . SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 34. Accessing Big SQL data from Spark shell // based on BigInsights 4.1, which includes Spark 1.5.1 // establish a Hive context val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) // query some Big SQL data val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact") // action on the data – count # of rows saleFacts.count() . . . // transform the data as needed (create a Vector with data from 2 cols) val subset = saleFacts.map {row => Vectors.dense(row.getDouble(16),row.getDouble(17))} // invoke basic Spark MLlib statistical function over the data val stats = Statistics.colStats(subset) // print one of the statistics collected println(stats.mean)
  • 35. A word about . . . SerDes •  Custom serializers / deserializers (SerDes) – Read / write complex or “unusual” data formats (e.g., JSON) – Commonly used by Hadoop community – Developed by user or available publicly •  Add SerDes to directories; reference SerDe when creating table -- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe -- Location clause points to DFS dir containing JSON data -- External clause means DFS dir & data won’t be drop after DROP TABLE command create external hadoop table socialmedia-json (Country varchar(20), . . . ) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '</hdfs_path>/myJSON'; select * from socialmedia-json;
  • 36. Sample JSON input for previous example JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
  • 37. Sample Big SQL query output for JSON data Sample output: Select * from socialmedia-json
  • 38. How to get started Exploring your data without writing code using BigSheets
  • 39. Spreadsheet-style analysis (BigSheets) •  Web-based analysis and visualization •  Spreadsheet-like interface – Explore, manipulate data without writing code – Invoke pre-built functions – Generate charts – Export results of analysis – Create custom plug-ins – . . .
  • 40. Working with BigSheets •  Create workbook for data in DFS •  Customize workbook through graphical editor and built-in functions – Filter data – Apply functions / macros / formulas – Combine data from multiple workbooks •  “Run” workbook: apply work to full data set •  Explore results in spreadsheet format and/or create charts •  Optionally, export your data Builder Front End Evaluation Service Simulation PIG Results Model Model w/ Data Full Execution
  • 41. Summary and resources Discover how you can take the next step with Big Data
  • 42. Summary •  Big Data analytics in high demand – Open source technologies (e.g., Apache Hadoop, Spark) – Vendor-specific analytic tools, engines, and applications •  Multiple options to build Big Data skills with IBM BigInsights – Cloud: Bluemix, IMDemo cloud (tech previews) – VMWare / Docker images for your laptop (free download) – IBM BigInsights Quick Start edition native installation (free download) 4 1
  • 43. Hadoop Dev: developer site for IBM BigInsights Downloads, forums, labs, papers, etc on Hadoop Dev https://developer.ibm.com/hadoop/
  • 44. Thank You Your Feedback is Important! Access the InterConnect 2016 Conference Attendee Portal to complete your session surveys from your smartphone, laptop or conference kiosk.