Apache Druid 0.17版本发布包含超过250个新功能和优化增强性能

Apache Druid是开源OLAP分布式时序数据商业智能分析解决方案,被广泛应用于互联网公司,腾讯广告、阿里电商等相关公司都使用。2020-1-27 Apache Druid 0.17.0 版本发布,这个版本中包含超过250个新功能和优化增强性能。Druid支持单台服务器部署也支持集群部署,单台部署性能高于MySQL单台服务8倍以上。单台部署需要的硬件资源和使用方法请查阅如下文档

Apache Druid 0.17.0 contains over 250 new features, performance enhancements, bug fixes, and major documentation improvements from 52 contributors. Check out the complete list of changes and everything tagged to the milestone.

Apache Druid

Highlights

Batch ingestion improvements

Druid 0.17.0 includes a significant update to the native batch ingestion system. This update adds the internal framework to support non-text binary formats, with initial support for ORC and Parquet. Additionally, native batch tasks can now read data from HDFS.

This rework changes how the ingestion source and data format are specified in the ingestion task. To use the new features, please refer to the documentation on InputSources and InputFormats.

Please see the following documentation for details:
https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#input-format
https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html#input-sources
https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html#partitionsspec

#8812

Single dimension range partitioning for parallel native batch ingestion

The parallel index task now supports the single_dim type partitions spec, which allows for range-based partitioning on a single dimension.

Please see https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html for details.

Compaction changes

Parallel index task split hints

The parallel indexing task now has a new configuration, splitHintSpec, in the tuningConfig to allow for operators to provide hints to control the amount of data that each first phase subtask reads. There is currently one split hint spec type, SegmentsSplitHintSpec, used for re-ingesting Druid segments.

Parallel auto-compaction

Auto-compaction can now use the parallel indexing task, allowing for greater compaction throughput.

To control the level of parallelism, the auto-compactiontuningConfig has new parameters, maxNumConcurrentSubTasks and splitHintSpec.

Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#compaction-dynamic-configuration for details.

#8570

Stateful auto-compaction

Auto-compaction now uses the partitionSpec to track changes made by previous compaction tasks, allowing the coordinator to reduce redundant compaction operations.

Please see #8489 for details.

If you have auto-compaction enabled, please see the information under “Stateful auto-compaction changes” in the “Upgrading to Druid 0.17.0” section before upgrading.

Parallel query merging on brokers

The Druid broker can now opportunistically merge query results in parallel using multiple threads.

Please see druid.processing.merge.useParallelMergePool in the Broker section of the configuration reference for details on how to configure this new feature.

Parallel merging is enabled by default (controlled by the druid.processing.merge.useParallelMergePool property), and most users should not have to change any of the advanced configuration properties described in the configuration reference.

Additionally, merge parallelism can be controlled on a per-query basis using the query context. Information about the new query context parameters can be found at https://druid.apache.org/docs/0.17.0/querying/query-context.html.

#8578

SQL-compatible null handling

In 0.17.0, we have added official documentation for Druid’s SQL-compatible null handling mode.

Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#sql-compatible-null-handlingand https://druid.apache.org/docs/0.17.0/design/segments.html#sql-compatible-null-handling for details.

Several bugs that existed in this previously undocumented mode have been fixed, particularly around null handling in numeric columns. We recommend that users begin to consider transitioning their clusters to this new mode after upgrading to 0.17.0.

The full list of null handling bugs fixed in 0.17.0 can be found at https://github.com/apache/druid/issues?utf8=%E2%9C%93&q=label%3A%22Area+-+Null+Handling%22+milestone%3A0.17.0+

LDAP extension

Druid now supports LDAP authentication. Authorization using LDAP groups is also supported by mapping LDAP groups to Druid roles.

  • LDAP authentication is handled by specifying an LDAP-type credentials validator.
  • Authorization using LDAP is handled by specifying an LDAP-type role provider, and defining LDAP group->Druid role mappings within Druid.

LDAP integration requires the druid-basic-security core extension. Please see https://druid.apache.org/docs/0.17.0/development/extensions-core/druid-basic-security.html for details.

As this is the first release with LDAP support, and there are a large variety of LDAP ecosystems, some LDAP use cases and features may not be supported yet. Please file an issue if you need enhancements to this new functionality.

#6972

Dropwizard emitter

A new Dropwizard metrics emitter has been added as a contrib extension.

The currently supported Dropwizard metrics types are counter, gauge, meter, timer and histogram. These metrics can be emitted using either a Console or JMX reporter.

Please see https://druid.apache.org/docs/0.17.0/design/extensions-contrib/dropwizard.html for details.

#7363

Self-discovery resource

A new pair of endpoints have been added to all Druid services that return information about whether the Druid service has received a confirmation that the service has been added to the cluster, from the central service discovery mechanism (currently ZooKeeper). These endpoints can be useful as health/ready checks.

The new endpoints are:

  • /status/selfDiscovered/status
  • /status/selfDiscovered

Please see the Druid API reference for details.

#6702
#9005

Supervisors system table

Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table.

Please see https://druid.apache.org/docs/0.17.0/querying/sql.html#supervisors-table for details.

#8547

Fast historical start with lazy loading

A new boolean configuration property for historicals, druid.segmentCache.lazyLoadOnStart, has been added.

This new property allows historicals to defer loading of a segment until the first time that segment is queried, which can significantly decrease historical startup times for clusters with a large number of segments.

Please see the configuration reference for details.

#6988

Historical segment cache distribution change

A new historical property, druid.segmentCache.locationSelectorStrategy, has been added.

If there are multiple segment storage locations specified in druid.segmentCache.locations, the new locationSelectorStrategy property allows the user to specify what strategy is used to fill the locations. Currently supported options are roundRobin and leastBytesUsed.

Please see the configuration reference for details.

#8038

New readiness endpoints

A new Broker endpoint has been added: /druid/broker/v1/readiness.

A new Historical endpoint has been added: /druid/historical/v1/readiness.

These endpoints are similar to the existing /druid/broker/v1/loadstatus and /druid/historical/v1/loadstatus endpoints.

They differ in that they do not require authentication/authorization checks, and instead of a JSON body they only return a 200 success or 503 HTTP response code.

#8841

Support task assignment based on MiddleManager categories

It is now possible to define a “category” name property for each MiddleManager. New worker select strategies that are category-aware have been added, allowing the user to control how tasks are assigned to MiddleManagers based on the configured categories.

Please see the documentation for druid.worker.category in the configuration reference, and the following links, for more details:
https://druid.apache.org/docs/0.17.0/configuration/index.htmlEqual-Distribution-With-Category-Spec
https://druid.apache.org/docs/0.17.0/configuration/index.html#Fill-Capacity-With-Category-Spec
https://druid.apache.org/docs/0.17.0/configuration/index.html#WorkerCategorySpec

#7066

Security vulnerability updates

A large number of dependencies have been updated to newer versions to address security vulnerabilities.

Please see the PRs below for details:

Upgrading to Druid 0.17.0

Select native query has been replaced

The deprecated Select native query type has been replaced in 0.17.0.

Please use the Scan native query type instead (https://druid.apache.org/docs/0.17.0/querying/scan-query.html).

For Druid SQL queries, no user action is needed; the SQL planner will create Scan queries instead of Select queries when applicable.

#8739

Old consoles have been removed

The legacy coordinator and overlord consoles have been removed, and they have been replaced with the new web console on the coordinator and overlord.

#8838

Calcite 1.21 upgrade, Druid SQL null handling

Druid 0.17.0 updates Calcite to version 1.21. This newer version of Calcite can make additional optimizations that assume SQL-compliant null handling behavior when planning queries.

If you use Druid SQL and rely on null handling behavior, please read the information at https://druid.apache.org/docs/0.17.0/configuration/index.html#sql-compatible-null-handling and ensure that your Druid cluster is running in the SQL-compliant null handling mode before upgrading.

#8566

Logging adjustments

Druid 0.17.0 has tidied up its lifecycle, querying, and ingestion logging.

Please see #8889 for a detailed list of changes. If you relied on specific log messages for external integrations, please review the new logging changes before upgrading.

The full set of log messages can still be seen when logging is set to DEBUG level. Template log4j2 configuration files that show how to enable per-package DEBUG logging are provided in the _commonconfiguration folder in the example clusters under conf/druid.

Stateful auto-compaction changes

The auto-compaction scheduling logic in 0.17.0 tracks additional segment partitioning information in Druid’s metadata store that is not present in older versions. This information is used to determine whether a set of segments has already been compacted under the cluster’s current auto-compaction configurations.

When this new metadata is not present, a set of segments will always be scheduled for an initial compaction and this new metadata will be created after they are compacted, allowing the scheduler to skip them later if auto-compaction config is unchanged.

Since this additional segment partitioning metadata is not present before 0.17.0, the auto-compaction scheduling logic will re-compact all segments within a datasource once after the upgrade to 0.17.0.

This re-compaction on the entire set of segments for each datasource that has auto-compaction enabled means that:

  • There will be a transition period after the upgrade where more total compaction tasks will be queued than under normal conditions
  • The deep storage usage will increase as the entire segment set is re-compacted (the old set of segments is still kept in deep storage unless explicitly removed).

Users are advised to be aware of the temporary increase in scheduled compaction tasks and the impact on deep storage usage. Documentation on removing old segments is located at https://druid.apache.org/docs/0.17.0/ingestion/data-management.html#deleting-data

targetCompactionSizeBytes property removed

The targetCompactionSizeBytes property has been removed from the compaction task and auto-compaction configuration. For auto-compaction, maxRowsPerSegment is now a mandatory configuration. For non-auto compaction tasks, any partitionsSpec can be used.

#8573

Compaction task tuningConfig

Due to the parallel auto-compaction changes introduced by #8570, any manually submitted compaction task specs need to be updated to use an index_parallel type for the tuningConfig section instead of index. These spec changes should be applied after the cluster is upgraded to 0.17.0.

Existing auto-compaction configs can remain unchanged after the update; the auto-compaction will create non-parallel compaction tasks until the auto-compaction configs are updated to use parallelism post-upgrade.

To control the level of parallelism, the auto-compactiontuningConfig has new parameters, maxNumConcurrentSubTasks and splitHintSpec.

Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#compaction-dynamic-configuration for details.

Compaction task ioConfig

The compaction task now requires an ioConfig in the task spec.

Please see https://druid.apache.org/docs/0.17.0/ingestion/data-management.html#compaction-ioconfigfor details.

ioConfig does not have to be added to existing auto-compaction configurations, the coordinator after the upgrade will automatically create task specs with ioConfig sections.

#8571

Renamed partition spec fields

The targetPartitionSize and maxSegmentSize fields in the partition specs have been deprecated. They have been renamed to targetNumRowsPerSegment and maxRowsPerSegment respectively.

#8507

Cache metrics are off by default

Cache metrics are now disabled by default. To enable cache metrics, add "org.apache.druid.client.cache.CacheMonitor" to the druid.monitoring.monitors property.

#8561

Supervisor API has changed to be consistent with task API

Supervisor task specs should now put the dataSchema, tuningConfig, and ioConfig sections as subfields of a spec field. Please see #8810 for examples.

The old format is still accepted in 0.17.0.

Segments API semantics change

The /datasources/{dataSourceName}/segments endpoint on the Coordinator now returns all used segments (including overshadowed) on the specified intervals, rather than only visible ones.

#8564

Password provider for basic authentication of HttpEmitterConfig

The druid.emitter.http.basicAuthentication property now accepts a password provider. We recommend updating your configurations to use a password provider if using the HTTP emitter.

#8618

Multivalue expression transformation change

Reusing multi-valued columns in expressions will no longer result in unnecessary cartesian explosions. Please see the following links for details.

#8947
#8957

Kafka/Kinesis ingestion during rolling upgrades

During a rolling upgrade, if there are tasks running 0.17.0 and overlords running older versions, and a task made progress reading data from its stream but rejected all the records it saw (e.g., all were unparseable), you will see NullPointerExceptions on overlords running older versions when the task updates the overlord with its current stream offsets.

Previously, there was a bug in this area (#8765) where such tasks would fail to communicate their current offsets to the overlord. The task/overlord publishing protocol has been updated to fix this, but older overlords do not recognize this protocol change.

This condition should be fairly rare.

ParseSpec.verify method removed

If you were maintaining a custom extension that provides an implementation for the ParseSpec interface, the verify method has been removed, and the @Override annotation on the method will need to be removed in any custom implementations.

#8744

Known issues

Filtering on long columns in SQL-compatible null handling mode

We are currently aware of a bug with applying certain filter types on null values from long columns when SQL-compatible null handling is enabled (#9255).

Please file an issue if you encounter any other null handling problems.

Ingestion spec preview in web console

The preview specs shown for native batch ingestion tasks created in the Data Loader of the web console are not correctly formatted and will fail if you copy them and submit them manually. Submitting these specs through the Data Loader submits a correctly formatted spec, however.

#9144

Other known issues

For a full list of open issues, please see https://github.com/apache/druid/labels/Bug

推荐文章

沪公网安备 31010702002009号