New Developments in Spark


The Presentation inside:

Slide 0

New Developments in Spark Matei Zaharia August 18th, 2015


Slide 1

About Databricks Founded by creators of Spark in 2013 and remains the top contributor End-to-end service for Spark on EC2 • Interactive notebooks, dashboards, and production jobs


Slide 2

Our Goal for Spark Unified engine across data workloads and platforms Streaming SQL ML Graph Batch … …


Slide 3

Past 2 Years Fast growth in libraries and integration points • New library for SQL + DataFrames • 10x growth of ML library • Pluggable data source API • R language Result: very diverse use of Spark • Only 40% of users on Hadoop YARN • Most users use at least 2 of Spark’s built-in libraries • 98% of Databricks customers use SQL, 60% use Python


Slide 4

Beyond Libraries Best thing about basing Spark’s libraries on a high-level API is that we can also make big changes underneath them Now working on some of the largest changes to Spark Core since the project began


Slide 5

This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution


Slide 6

Hardware Trends Storage Network CPU


Slide 7

Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz


Slide 8

Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz


Slide 9

Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10x Network 1Gbps 10Gbps 10x CPU ~3GHz ~3GHz L


Slide 10

Tungsten: Preparing Spark for Next 5 Years Substantially speed up execution by optimizing CPU efficiency, via: (1) Off-heap memory management (2) Runtime code generation (3) Cache-aware algorithms


Slide 11

Interfaces to Tungsten Spark SQL DataFrames (Python, Java, Scala, R) RDDs … NVRAM … Data schema + query plan Tungsten backends JVM LLVM GPU


Slide 12

DataFrame API Single-node tabular structure in R and Python, with APIs for: relational algebra (filter, join, …) math and stats input/output (CSV, JSON, …) Google Trends for “data frame”


Slide 13

DataFrame: lingua franca for “small data” head(flights) #>  Source:  local  data  frame  [6  x  16] #>   #>        year  month  day  dep_time dep_delay arr_time arr_delay carrier  tailnum #>  1    2013          1      1            517                  2            830                11            UA    N14228 #>  2    2013          1      1            533                  4            850                20            UA    N24211 #>  3    2013          1      1            542                  2            923                33            AA    N619AA #>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB #>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...


Slide 14

Spark DataFrames • DataFrame = RDD + schema Capture many operations as expressions in a DSL • Enables rich optimizations df = jsonFile(“tweets.json”) df(df(“user”) === “matei”) .groupBy(“date”) .sum(“retweets”) 10 Running Time Structured data collections with similar API to R/Python 5 0 Python RDD Scala RDD DataFrame 15


Slide 15

How does Tungsten help?


Slide 16

1. Off-Heap Memory Management Store data outside JVM heap to avoid object overhead & GC • For RDDs: fast serialization libraries • For DataFrames & SQL: binary format we compute on directly 2-10x space saving, especially for strings, nested objects Can use new RAM-like devices, e.g. flash, 3D XPoint


Slide 17

2. Runtime Code Generation Can do same in core, ML and graph • Code-gen serializers, fused functions, math expressions Interpreted Code gen Projection Avoids virtual calls and generics/boxing Evaluating “SELECT a+a+a” (time in seconds) Hand written Generate Java code for DataFrame and SQL expressions requested by user 36.7 9.4 9.3


Slide 18

3. Cache-Aware Algorithms Use custom memory layout to better leverage CPU cache Example: AlphaSort-style prefix sort • Store prefixes of sort key inside pointer array • Compare prefixes to avoid full record fetches + comparisons Naïve layout Cache friendly layout pointer key prefix record pointer record


Slide 19

Tungsten Performance Results 1200 1000 800 Default Run time 600 (seconds) 400 Code Gen Tungsten onheap 200 Tungsten offheap 0 1x 2x 4x 8x Data set size (relative) 16x


Slide 20

This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution


Slide 21

Motivation Network and storage speeds have improved 10x, but this speed isn’t always easy to leverage! Many challenges with: • Keeping disk operations large (even on SSDs) • Keeping network connections busy & balanced across cluster • Doing all this on many cores and many disks


Slide 22

Sort Benchmark Started by Jim Gray in 1987 to measure HW+SW advances • Many entrants use purpose-built hardware & software Participated in largest category: Daytona GraySort • Sort 100 TB of 100-byte records in a fault-tolerant manner Set a new world record (tied with UCSD) • Saturated 8 SSDs and 10 Gbps network / node • 1st time public cloud + open source won


Slide 23

On-Disk Sort Record Time to sort 100 TB 2013 Record: Hadoop 2100 machines 2014 Record: Spark 207 machines 72 minutes 23 minutes Also sorted 1 PB in 4 hours Source: Daytona GraySort benchmark, sortbenchmark.org


Slide 24

Saturating the Network 1.1GB/sec per node


Slide 25

This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution


Slide 26

Motivation Query planning is crucial to performance in distributed setting • Level of parallelism in operations • Choice of algorithm (e.g. broadcast vs. shuffle join) Hard to do well for big data even with cost-based optimization • Unindexed data => don’t have statistics • User-defined functions => hard to predict Solution: let Spark change query plan adaptively


Slide 27

Traditional Spark Scheduling file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce sort


Slide 28

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map


Slide 29

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map


Slide 30

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce


Slide 31

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce


Slide 32

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce


Slide 33

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce sort


Slide 34

Advanced Example: Join Goal: Bring together data items with the same key


Slide 35

Advanced Example: Join Goal: Bring together data items with the same key Shuffle join (good if both datasets large)


Slide 36

Advanced Example: Join Goal: Bring together data items with the same key Broadcast join (good if top dataset small)


Slide 37

Advanced Example: Join Goal: Bring together data items with the same key Hybrid join (broadcast popular key, shuffle rest)


Slide 38

Advanced Example: Join Goal: Bring together data items with the same key Hybrid join (broadcast popular key, shuffle rest) More details: SPARK-9850


Slide 39

Impact of Adaptive Planning Level of parallelism: 2-3x Choice of join algorithm: as much as 10x Follow it at SPARK-9850


Slide 40

Effect of Optimizations in Core Often, when we made one optimization, we saw all of the Spark components get faster • Scheduler optimization for Spark Streaming => SQL 2x faster • Network optimizations => speed up all comm-intensive libraries • Tungsten => DataFrames, SQL and parts of ML Same applies to other changes in core, e.g. debug tools


Slide 41

Conclusion Spark has grown a lot, but it still remains the most active open source project in big data Small core + high-level API => can make changes quickly New hardware => exciting optimizations at all levels


Slide 42

Learn More: sparkhub.databricks.com


Slide 43


×

HTML:





Ссылка: