How Concur uses Big Data to get you to Tableau Conference On Time

How Concur Uses Big Data to
Get You to TC On Time
Denny Lee
Senior Director, Data Sciences Engineering

About Concur
What do we do?
• Leading provider of spend management solutions and
(Travel, Invoice, TripIt, etc.) services in the world
• Global customer base of 20,000 clients and 25 million
users
• Processing more than $50 Billion in Travel & Expense
(T&E) spend each year

About the Speaker
Who Am I?
• Long time SQL Server BI
guy (24TB Yahoo! Cube)
• Project Isotope (Hadoop
on Windows and Azure)
• At Concur, helping with
Big Data and Data
Sciences

Is Big Data ….
The most overused buzzword today?
An actual useful framework?
Yes!

Consolidate Visualize Insight Recommend
TechBar
Themes

BTS
Invoice Web Analytics
Expense
Travel
Weather

A long time ago…
• We started using Hadoop because
• It was free
• i.e. Didn’t want to pay for a big data warehouse
• Could slowly extract from hundreds of relational data
sources, consolidate it, and query it
• We were not thinking about advanced analytics
• We were thinking …. “cheaper reporting”
• We have some hardware lying around … let’s cobble it
together and now we have reports

But why Hadoop?
• Even with primarily relational systems, it involved
hundreds of sources
• Getting Tableau or any BI tool to connect to so many
sources is … not fun
• More times than not, we needed to understand a subset
or aggregate of this data - not all of the data!
• Can use Pig to process, extract, filter the data
• Can use Hive - a SQL like query language - to query my
data

demo
Querying Hive via Hue and Tableau
to understand Air Traffic patterns

Connecting to Hive using Hue - can query using HiveQL, a SQL-like query language

Install Cloudera Hive Driver, Connect to Cloudera Hadoop, fill in above
and you’re connected to Hive

Connecting Tableau to Hive may take a very long time in Live mode

Instead, choose Extract which will bring the data across from Hive and you
run live queries within Tableau. Note, the extraction will take a long time too!

Now that the data is in Tableau, I can pivot, slice, and filter at the speed of thought!

Can quickly switch to map mode and determine where most itineraries are from in 2013

If you’re expecting to Hadoop
or Hive to be fast….

Evolution of Hive
• Hive built originally by Facebook placed
a SQL-like query language in front of
Hadoop Map-Reduce.
• Has its flexibility but also its overhead
and complexity
• Apache community working on Hive
Stinger project to advance Hive
including DAG scheduler, optimized
columnar format, and improved engine
semantics

demo
Querying Impala via Hue and Tableau
to understand Air Departure Delays

Query airport information using Impala, sort of looks like Hive so far…

But notice the query running in Impala significantly faster!

Not just limit 10 types of queries but ones that involve more complicated
where clauses

And quickly chart out the results - e.g. highest airport in Taiwan is
Sun Moon Lake

Or even quickly map out the airport locations on a map to see that Sun Moon
Lake Airport is in the center of Taiwan

And using Impala is not just for Hue
- its even better on Tableau

Now I can connect to my data live and have fast queries returned to Tableau

After quickly modifying the data within Tableau, can discover the amount of flight
delays to Seattle, and denote that San Jose has the least # of delays

Why Impala?
• Focus is to speed up BI queries
• Analogous to relational BI tools except
now I can do this against a distributed
cluster
• Similar to relational BI tools that as its
special purpose, can do a lot of
optimizations to improve speed
• But note this demo was against the
same Hive table against data stored in
Hadoop

demo
Leveraging AtScale to build models on
Impala and slicing them in Tableau

Using AtScale to build up a dimensional model based on the data that is
stored within Impala / Hive

Slice and filter the Impala model using Tableau
For more info, check out: http://atscale.com/

Data Extraction
How to query multiple endpoints or multiple data sources?
Setup a whole bunch of VMs and have someone connecting to
each one and executing get commands?

Optimizing Data Extraction
Use Hadoop streaming to execute python script to perform get
Hadoop will generate tasks for each API get call and then execute
it across all the clusters in the node in parallel

TechBar
Quick Primer on Apache Spark

What is Apache Spark?
Fast and general cluster computing system
interoperable with Hadoop
Improves efficiency through:
»In-memory computing primitives
»General computation graphs
Improves usability through:
»Rich APIs in Scala, Java, Python
»Interactive shell
Up to 100× faster
(2-10× on disk)
2-5× less code

Project History
Started in 2009, open sourced 2010
30+ companies now contributing code
»Databricks, Yahoo!, Intel, Adobe, Cloudera, Bizo,
…
One of the largest communities in big data

A General Stack
Spark
Spark
Streaming
real-time
Shark
SQL
GraphX
graph
MLlib
machine
learning
…

demo
Applying Spark for Recommendations

Starbucks Store #3313
601 108th Ave NE
Bellevue, WA (425) 646-9602
-------------------------------
Chk 713452
05/14/2014 11:04 AM
1961558 Drawer: 1 Reg: 1
-------------------------------
Bacon Art Brkfst 3.45
Warmed
T1 Latte 2.70
Triple 1.50
Soy 0.60
Gr Vanilla Mac 4.15
Reload Card 50.00
AMEX $50.00
XXXXXXXXXXXXXXXXXX1004
SBUX Card $13.56
SUBTOTAL $62.40
New Caffe Espresso
Frappuccino(R) Blended beverage
Our Signature
Frappuccino(R) roast coffee and
fresh milk, blended with ice.
Topped with our new espresso
whipped cream and new
Italian roast drizzle
Expense Categorization
One of my receipts that I had OCRed
One of the issues we’re trying to solve
is to auto-categorize this, so how
can we do this?
Below is a simplistic solution using
WordCount
Note, a real solution should involve
machine learning algorithms

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from
SCDynamicStore
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
Spark context available as sc.
scala> val receipt =
sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt")
receipt: org.apache.spark.rdd.RDD[String] =
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile
at <console>:12
scala> receipt.count
res0: Long = 30

scala> val words = receipt.flatMap(_.split(" "))
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at
<console>:14
scala> words.count
res1: Long = 161
scala> words.distinct.count
res2: Long = 72
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ +
_).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)}
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at
<console>:16
scala> wordCounts.take(12)
res5: Array[(String, Int)] = Array(("",82), (with,2),
(Card,2), (new,2), (-------------------------------
,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2),
(New,1), (Topped,1), (Starbucks,1))

Still beta, but can connect from Tableau to SparkSQL using Shark driver

Can / will be able to connect to this SparkSQL live

Quick view of Android vs. iOS mobile sessions

SparkSQL - What’s Next?
• Currently makes use of Hive code-base
• Major focus for 1.2
• Pluggable external datasources
• Easier access through pure SQL
interface
• Access things like JSON tables
though SQL?

Consolidate Visualize Insight Recommend

Invite
• Pacific Northwest Cloudera User Group
• http://bit.ly/1uFD6vJ
• Doug Cutting, Hadoop Co-Creator, will be speaking at
Disney on 9/24
• Seattle Spark Meetup
• http://bit.ly/1q4Z0Ke
• Next sessions:
• Deep Dive into Spark and Mesos Internals
• Unlocking your Hadoop data with Apache Spark
and CDH5

How Concur uses Big Data to get you to Tableau Conference On Time

How Concur uses Big Data to get you to Tableau Conference On Time

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How Concur uses Big Data to get you to Tableau Conference On Time

Similar to How Concur uses Big Data to get you to Tableau Conference On Time (20)

More from Denny Lee

More from Denny Lee (20)

Recently uploaded

Recently uploaded (20)

How Concur uses Big Data to get you to Tableau Conference On Time

Editor's Notes